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Preface 



The Second International Conference on Data Warehousing and Knowledge 
Discovery (DaWaK 2000) was held in Greenwich, UK 4-6 Septemher. DaWaK 2000 
was a forum where researchers from data warehousing and knowledge discovery 
disciplines could exchange ideas on improving next generation decision support and 
data mining systems. 

The conference focused on the logical and physical design of data warehousing and 
knowledge discovery systems. The scope of the papers covered the most recent and 
relevant topics in the areas of data warehousing, multidimensional databases, OLAP, 
knowledge discovery and mining complex databases. These proceedings contain the 
technical papers selected for presentation at the conference. 

We received more than 90 papers from over 20 countries and the program committee 
finally selected 3 1 long papers and 1 1 short papers. The conference program included 
three invited talks, namely, “A Foolish Consistency: Technical Challenges in 
Consistency Management” by Professor Anthony Finkelstein, University College 
London, UK; “European Plan for Research in Data Warehousing and Knowledge 
Discovery” by Dr. Harald Sonnberger (Head of Unit A4, Eurostat, European 
Commission); and “Security in Data Warehousing” by Professor Bharat Bhargava, 
Purdue University, USA. 

We would like to thank the DEXA 2000 workshop general chair (Professor Roland 
Wagner) and the organizing committee of the 11th International Conference on 
Database and Expert Systems Applications (DEXA 2000) for their support and 
cooperation. Many many thanks are due Ms Gabriela Wagner for providing a great 
deal of help and assistance. We are very indebted to all program committee members 
and outside reviewers who have very carefully and timely reviewed the papers. We 
would also like to thanks all the authors who submitted their papers to DaWaK 2000; 
they provided us with an excellent technical program. 

Y ahiko Kambayashi, General Chair 

Mukesh Mohania and A Min Tjoa, Program Committee Chairs 
Tok Wang Ling, Panel Chair 
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The Design and Development of a Logical 
System for OLAP* 



Luca Cabibbo and Riccardo Torlone 

Dipartimento di Informatica e Automazione, Universita di Roma Tre 
Via della Vasca Navale, 79 — 1-00146 Roma, Italy 
{cabibbo , torlone}@dia . uniromaS . it 



Abstract. We report on the design and development of a novel archi- 
tecture for data warehousing. This architecture adds a “logical” level of 
abstraction to the traditional data warehousing framework, to guarantee 
an independence of OLAP applications from the physical storage struc- 
ture of the data warehouse. This property allows users and applications 
to manipulate multidimensional data ignoring implementation details. 
Also, it supports the integration of multidimensional data stored in het- 
erogeneous OLAP servers. We propose MT>, a simple data model for 
multidimensional databases, as the reference for the logical layer. We 
then describe the design of a system, called MDS, that supports the 
above logical architecture. 



1 Introduction 

Nowadays, current technology provides a lot of software tools supporting data 
warehousing [3-6]. Apart a number of facilities implementing specific activi- 
ties, the main components of a data warehousing system are the data ware- 
house server and the front-end clients, often called OLAP tools. Data warehouse 
servers can be relational (ROLAP), multidimensional (MOLAP), or hybrid sys- 
tems. ROLAP systems store a data warehouse in relational tables and maps 
multidimensional operations to extended SQL statements. A quite standard or- 
ganization of a data warehouse in a ROLAP system is the star seheme (or variant 
thereof) [5], with a central table representing the fact on which the analysis is 
focused, and a number of tables, only partially normalized, representing the di- 
mensions of analysis (e.g., time, location, type). Conversely, MOLAP systems are 
special servers that directly store and manipulate data in multidimensional pro- 
prietary formats. In hybrid OLAP systems data can be stored in both relational 
and multidimensional form. Front-end tools offer powerful querying and report- 
ing capabilities, usually based on interactive graphical user interfaces similar to 
spreadsheets. Standard Application Program Interfaces (API) for communica- 
tion between data warehouse server and clients are also emerging (e.g., OLE DB 
for OLAP [7] and MD-API [6]). 

Although the available tools are powerful and very useful, we believe that 
there is a limitation in the traditional organization of a data warehousing system. 

* This work was partially supported by IASI, by CNR and by MURST. 



Y. Kambayashi, M. Mohania, and A M. Tjoa (Eds.): DaWaK 2000, LNCS 1874, pp. 1-10, 2000. 
© Springer-Verlag Berlin Heidelberg 2000 




2 



L. Cabibbo and R. Torlone 



The problem is that the way in which information is viewed and manipulated 
in an OLAP tool strictly depends on how data is stored in the server. Several 
OLAP tools somehow filter the underlying technology, by providing a high-level 
interface, but they are usually customized to work with a specific data ware- 
house server or class of servers (relational or multidimensional). Also and more 
important, if the organization of the data warehouse changes, there is often the 
need to rebuild the front-end views of the data warehouse. For example, if an 
OLAP tool provides a multidimensional view of a relational star scheme in the 
form of a spreadsheet and the scheme is just normalized (or denormalized) for 
efficiency reasons, the view is no more valid and needs to be redefined. Another 
consequence of the dependency of OLAP applications from their implementation 
is that the integration of data coming from different data warehousing systems 
is difficult to achieve, since no standardization exists in this context. 

The goal of this paper is to illustrate a new architecture for data warehous- 
ing aimed at solving this problem. The basic idea is to add an explicit level 
of abstraction to the traditional data warehousing framework. The new level, 
which we call “logical” , serves to guarantee an independence of OLAP applica- 
tions from the physical storage structure of the data warehouse. This is similar 
to what happens with relational technology, in which the property of data inde- 
pendence allows users and applications to manipulate tables and views ignoring 
implementation details. 

We make use of a simple data model, called ATD, as the reference for the 
logical level. This model provides a number of constructs used to describe, in an 
abstract but natural way, the basic notions that can be found in almost every 
OLAP system (fact, dimension, level of aggregation, and measure). 

We then describe the design of a system, called MDS, that supports the 
above logical architecture. MDS is able to view an existing data warehouse, 
which can be stored in a variety of systems, in terms of the MT> model and to 
map queries and data between the two frameworks. By providing a practical 
example of use of MDS, we show how the architecture we propose can make a 
data warehouse more effective and easier to use. 

The rest of the paper is organized as follows. In Sect. 2 we give a general 
overview of the approach. Sect. 3 is devoted to the presentation of the system 
we have designed and, in Sect. 4, the most important component of this system 
is illustrated in detail with reference to a specific but important case. 

2 A Logical Approach to Data Warehousing 

According to Inmon [4] , the traditional architecture for data warehousing com- 
prises four different levels, as follows: 

— Operational layer. The operational sources of data, which are often hetero- 
geneous and distributed, from which the data warehouse is populated by 
means of cleaning, filtering, and integration operations. 

— Data Warehouse layer. The data warehouse lies at this level and is kept 
separate from the operational sources. Data is represented here either in 
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relational form (ROLAP systems) or in multidimensional proprietary form 
(MOLAP systems). 

— Data mart layer. Each data mart is generally built from a selected portion of 
the data warehouse for studying a specific, well-bounded business problem. 

— Individual layer. Data is presented to analysts for interpretation. Various 
types of graphical representations are used here. 

Variants of this organization are possible. In particular, either the data ware- 
house or the data mart layer can be absent. However, there is always an interme- 
diate level of persistent data between the operational and the individual layers; 
we will refer to this level as to the data warehouse level hereinafter. 

As we have said in the Introduction, a font of anomalies in the traditional 
architecture is that the way in which data is viewed and manipulated at the 
individual level strictly depends on the data warehouse level. As a consequence, 
changes to the organization of the data warehouse are propagated to the views 
defined on it, and thus to the external level. Moreover, the interoperability be- 
tween different data warehouses must be solved case by case. 

A possible solution to alleviate this problem is the introduction of a “logical” 
level in the data warehousing architecture. The new level serves to hide details of 
the actual implementation of the data warehouse and to concentrate on describ- 
ing, in abstract terms, the basic, multidimensional aspects of data employed in 
their analysis. Specifically, the data warehouse level is split into a logieal layer, 
which describes the content of the data warehouse in multidimensional but ab- 
stract terms, and an internal layer, which describes how the data warehouse is 
implemented (relational or multidimensional structures and their organization) . 
Queries and views are defined with respect to the logical data warehouse layer. 
According to this organization, data warehouse restructuring occurs at the inter- 
nal level and just requires the modification of the mapping between the internal 
and the logical layer. The logical representation of the data warehouse remains 
always the same and so queries and views defined on it. 

We believe that this architecture has a number of further advantages: (i) 
analysts can manipulate data easier since they can refer to a high-level represen- 
tation of data; (ii) interoperability between different data warehousing systems 
is facilitated, since the new level provides an abstract, and thus unifying, descrip- 
tion of data; (iii) the system can scale-up easily, for instance in case of addition 
of new data sources, since this corresponds to a simple extension of the data 
warehouse description at the logical level. 

The AdD Data Model and Algebra. The MultiDimensional data model [1] 
(MV for short) is based on a few constructs modeling the basic concepts that can 
be found in any OLAP system. An MV dimension is a collection of levels and 
corresponds to a business perspective under which the analysis can be performed. 
The levels of a dimension are data domains at different granularity and are 
organized into a hierarchy. Within a dimension, values of different levels are 
related through a family of roll-up funetions, according to the hierarchy defined 
on them. A roll-up function R-UpJ (also denoted by -^) associates a value v of 
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F-tables 

SALEs[Perjod : day, Product : item, Location : store] — )• 
[NSales : numeric. Income : numeric] 

CostOfItem [P roduct : item. Month : month] — )• [Cost : numeric] 



Fig. 1. The sample MT> scheme Retail 



a level I with a value v' of an upper level I' in the hierarchy. In this case we say 
that V rolls up to v' and also that I rolls up to I'. The main construct of the MV 
model is the f-table: this is a (partial) function that associates a collection of 
level values, called symbolic coordinate, with one or more measures. Components 
of coordinates are also called attributes. An entry of an f-table / is a coordinate 
over which / is defined. Thus, an f-table is used to represent factual data on 
which the analysis is focused (the measures) and the perspective of analysis (the 
coordinate). 

As an example, consider the MV scheme Retail, shown in Fig. 1. This 
scheme can be used by a marketing analyst of a chain of toy stores and is or- 
ganized along dimensions time, product, and location (shown on top of the 
figure). The time dimension is organized in a hierarchy of levels involving day, 
month, quarter, and year. Similarly, the location dimension is based on a hier- 
archy of levels involving store, city, and area. Finally, the product dimension 
contains levels item, category, and brand. There are two further atomic dimen- 
sions (that is, having just one level) that are used to represent numeric values 
and strings. In this framework, we can define the f-tables Sales and CostOf- 
Item. The former describes summary data for the sales of the chain, organized 
along dimensions time (at day level), product (at item level), and location (at 
store level) . The measures for this f-table are NSales (the number of items sold) 
and Income (the gross income), both having type numeric. The f-table CostOf- 
Item is used to represents the costs of the various items, assuming that costs 
may vary from month-to-month. 

In [2] we have defined a number of query languages for MV according to 
different needs: (i) a graphical and easy-to-use language for end-users; (ii) a 
declarative and powerful language (based on calculus) for experts; and (iii) a 
procedural language (based on an algebra) suitable to perform optimization and 
directly implementable. We now briefly recall the algebra for MV since it is 
used by the system as an intermediate language whose operators are directly 
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implemented or are translated into statements of a language (e.g., SQL or API) 
for an existing database server (a relational, a ROLAP, or a MOLAP system). 

The MV algebra [2], apart from a number of operators that are variants 
of relational algebra operators, comprises three new operators. Two of them 
allow the application of scalar {(f) and aggregative functions {'tp) on attributes 
of an f-table. The other operator {q) is specific for the MV model: it allows the 
application of a roll-up function to an f-table. 

As an example, assume that we need to construct from the Retail database 
(Fig. 1) the f-table West-Profit storing, for each store and month, the net 
revenue of the stores in the west area. The net revenue is to be calculated as the 
difference between total income and cost of the items sold. 

This query can be specified by means of the following MV algebra expression. 






net Re venue— sum( dr) ^ ^ ^dr— income — nsales*cost 
Product, Month 



(^rfr-mcome-n5a/e5*co5f(CoSTOFlTEMlXl^ 
Qperiod-.day ^rea— Westi,Qiocation ■.store (Sales))))) 



The inner roll-up operator Q in the expression extends the coordinates of the 
f-table Sales with a new attribute area corresponding to the store of each 
sale. The (7 operator is then used to select only the stores in the west area. 
The outer operator Q further extends the coordinates of the result with a new 
attribute month corresponding to the day of each sale. The equi-join operator 
IXI is then used to combine the f-table so obtained and the f-table CostOfItem, 
according to the values over the attributes Product and Month (we recall that 
in our example the costs of items are stored on monthly basis). The operator 
(f introduces to the result f-table the measure dr (the daily revenue) computed 
from the other measures by the specified scalar function. Finally, the aggregate 
operator tp computes the sum of the measure dr of the f-table we have obtained, 
grouping the result by the attributes Product and Month. 

We have shown that the MV algebra is equivalent to MV calculus and more 
powerful than the MV graphical language. It is therefore possible to automati- 
cally translate graphical and textual queries into algebraic expressions. 



3 Design of the System Architecture 

On the basis of the issues discussed in previous sections, we have designed a 
tool for data warehousing, called MDS (MultiDimensional System), whose over- 
all architecture is shown in Fig. 2. (In the figure, a box represents a software 
module, a cylinder represents a repository of data, and a line denotes a flow of 
information.) The whole environment is loosely coupled with one or more data 
warehouses managed by external storage systems. These systems can be OLAP 
servers (relational or multidimensional) and/or database management systems. 
The only hypothesis we make is that the basic multidimensional aspects of data 
analysis can be identified in the data warehouse representation provided by each 
system. 
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MDS user 




Fig. 2. The MDS system environment 

The Data Warehouse Interface (DWI) performs two important functions: 
(i) it realizes the mapping between an MV database scheme and the actual 
organization of a data warehouse, and (ii) it converts MV queries into operations 
understandable by the underlying data warehouse server. 

The DWI Wizard supports the MDS Manager with the construction of a 
DWI. This tool is separated from MDS but integrated with it. The DWI Wizard 
manages a library of DWI templates that refer to a specific database server but to 
a generic database. A DWI template can be customized, using the DWI Wizard, 
by specifying the actual scheme of a data warehouse and its correspondence with 
an MV database. 

The MDS Manager is the core of the system and performs several tasks: (i) 
it imports the scheme of a data warehouse from a data storage system through 
the corresponding DWI and stores it into a local MV Data Dictionary; (ii) it 
receives MV queries (in an algebraic form) against a stored MV scheme, and 
sends them to the DWIs for execution; (hi) it receives the result of a query from 
one or more DWIs, possibly combines them, and then gives the final result to the 
Scheme Manager for its visualization; (iv) it receives the request for a creation 
of a view and stores its definition in the MV Data Dictionary. Query evaluation 
is supported by the management of a local cache of MV data. 

The Scheme Manager requests the loading and translation into the MV 
model of an external data warehouse scheme, transfers the scheme of an available 
MV database to the User Interface for visualization, and allows the definition 
of views on it. 
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Relations 

Rfi Pate , Month, Quarter) R[(Quarter, Year) 

Rr,( Item , Category, Brand) Rii Store , Address, City, Area) 

Rs(Rem, Date, Store, Nsales, Revenue) Rc{it€.m, Month, Cost) 



Functional dependencies (besides keys) 
FD 

Rt -Month —)■ Rt- Quarter, 

FD 

Ri-City — >• Ri-Area 



Foreign keys 

FK FK FK 

Rs-Store — >• Ri, Rs-Rem — >• Rp, Rs-Date — >• Rt 

FK FK 

Rc-it€.m —)■ Rp, Rc -Month —)■ Rt-Month 

FK , 

Rt - Quarter — >• R^ 



Fig. 3. Example of the RelRetail data warehouse scheme 

The Visual query composer supports the user in the specification of MT> 
visual queries and verifies their correctness. The Textual query parser accepts 
queries written in the MV textual language. Both components transform MV 
queries into an internal format. The query Compiler translates visual and textual 
MV queries (specified according to the internal representation) into MV algebra 
expressions. The compiler also performs some optimization, by mainly pushing- 
down selections and projections. The User Interface allows the interaction with 
the system through both textual and graphical tools. The work of the user is 
supported by a number of menus and forms. 



4 Data Warehouse Interfaces 

Let M be the model adopted by a data warehouse server DS. Basically, a DWI 
for DS has three components: 

1. a Mapping 0, which associates concepts of a data warehouse scheme S of 
M with concepts of an MV scheme 5; 

2. a Query Translation Algorithm (QTA), which converts a query over an MV 
scheme into a query over a scheme of M that DS can understand; and 

3. a Data Translation Algorithm (DTA), which transforms the results of queries 
and views over the scheme S into MV instances. 

Both the algorithms used in a DWI make use of the mapping 0 and depend only 
on the features of DS. Therefore, once DS, and thus M, has been fixed, the only 
component that needs to be changed in the evolution of the data warehouse is the 
mapping 0. The idea of DWI template is actually based on this consideration: 
we can fix the algorithms and obtain a new DWI for a scheme of M by just 
modifying 0, as long as the organization of the data warehouse changes. 

In order to give more insights on the structure of a DWI and on the way 
in which it works, we illustrate next an example that refers to a specific but 
important case. Specifically, we describe a possible DWI for a relational data 
warehouse organized according to (variants of) the star scheme. This is indeed 
a quite standard representation of a data warehouse in a ROLAP server. 
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Dimensions 

Product — ! (item) {Rp.Item) , (category) {Rp. Category) , (brand) {Rp. Brand) } 
Location—! (store) -o- {Ri. Store), (city) -o- (Ri.City), (area) (Ri.Area) } 
Time—! (day) (Rt-Date), (month) (Rt-Month), (quarter) (Rt-Quarter), 
(year) -o- {R^.Year) } 

Roll-up functions 



(item 



category) {Rp.Item — )■ Rp. Category), (item 



(store • 



■ city) {Ri. Store — )■ Ri.City), (city ■ 



brand) {Rp.Item — )■ Rp. Brand) 
■ area) {Ri.City — )■ R^.^rea) 



(day — month) {Rt.Day — >• Rt-Month), (month — >• quarter) -f->- {Rt. Month — >• Rt-Quarter), 
(quarter A year) A {Rt-Quarter — )■ . Quarier — )■ Year) 



F-tables 



Sales o Rs 



CostOfItem o Rc 



SALEs[(Per«od : day) A {Rg-Date ^ Rt), {Product : item) A {Rg-Item ^ Rp), 
{Location : store) A {Rg -Store — )■ i?i)] — )■ 

[(N5a/es : numeric) A {Rg.Nsales), {Income : numeric) A {Rg -Revenue)] 

CosTOFlTEM[(Prod«ci : item) A {Rc -Item A- Rp), {Month : month) A {Rc -Month ^ Rt-Month)] 

[{Cost-, numeric) A {Rc - Cost)] 



Fig. 4. Example of mapping embedded in a relational DWI 

A Relational DWI. Let us consider the relational star scheme RelRetail for 
a data warehouse, shown in Fig. 3. Note that this scheme is a possible relational 
implementation of the MT> scheme Retail described in Fig. 1. In the scheme, 
Rs stores the daily sales and Rc the monthly costs of the items sold: they play 
the role of fact tables. The other tables correspond to dimensional relations: 
Rp stores item information, Ri location information and both Rt and R[ store 
temporal information. Note that the scheme is not normalized. Also, we have 
indicated the integrity constraints defined on the scheme: they suggest the way 
to derive MT> objects from the underlying relational database. 

The mapping 0 for the RelRetail scheme is sketched in Fig. 4. This mapping 
associates: (i) an MT> level with an attribute of a dimensional table, (ii) a roll-up 
function with a one or more relational integrity constraints, (iii) an MT> f-table 
with a fact table, (iv) coordinates of an MT> f-table with attributes and foreign 
keys from a fact table to a dimensional table, and (v) measures of an MT> f-table 
with attributes of a fact table. 

The mapping can be indeed used in both directions: on one hand it provides 
a basis for the translation of an MT> query into a relational query, on the other 
hand it indicates how a relational scheme can be expressed in terms of the MT> 
model. Actually, a Data Translation Algorithm can be easily derived from the 
mapping 0 and for this reason it will not be shown. 

Let us consider now the QTA algorithm that translates queries over an 
MT> scheme into SQL queries over a relational data warehouse server. In the 
algorithm, reported in Fig. 5, we make use of a four-tuple, which we call SQL 
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SQL Query Translation Algorithm: SQL-QTA 
Input: An expression E of the A4T) algebra 
Output: An SQL query descriptor 6 
begin 

6f := 0; 6^ : = 0; 6g : = 0; 

case E of 

E is an f-table F : 

6s := {the coordinates and measures of F, according to the mapping &}; 

6f := {the relation representing F, according to the mapping &}; 

E is a, selection (7 c i^') • 

S' := SQL-QTA(£^'); 6s :=6'^; 
if 6' is not aggregative then 

6w ■- O {the conditions corresponding to C, according to the mapping 0} 

else 

:= 

:= {the conditions corresponding to C, according to the mapping 0} 

endif 

is a projection 7T x (E') : 

6' := SQL-QTA(£:'); 

ds := {the coordinates and measures of JC, according to the mapping 0}; 

Sf-S'f-. 

E is an equi-join E'I^xE" : 

6' := SQL-QTA(£:'); 6" := SQI.-QTA(E") ; 

6s := 6'^ U 6'J ; 

if both 6' and 6" are not aggregative then 
Sf ■■= S'f^U 6'Jj 

{the conditions corresponding to X, according to the mapping 0} 

else 

Sf := U {r>; 

:= {the conditions corresponding to X, according to the mapping 0} 

endif 



E 



E 



is a scalar function application : 

6 ' := SQL-QTA(£^'); 

6s := U {/(•••) AS M, according to the mapping &}; 

6f := 6'f ; 6^ := 6'^ ; 6g := 6'g 

j N = g(. . .) ^ 

is an aggregate function application yJ {E ) : 

6 ' := SQL-QTA(£^'); 

6s := {p(...) AS N, according to the mapping 0} U 

{the attributes and measures of X, according to the mapping 0}; 
if 6' is not aggregative then 

6f := 6'^; 6^ := 6'^; 

6g := {the attributes and measures of X, according to the mapping 0} 

else 

Sf := 

6g := {the attributes and measures of X, according to the mapping 0} 
endif 



is a roll-up expression i^') • 

6 ' := SQL-QTA(£:'); 

6s := U {the attributes corresponding to / and l' , according to the mapping 0}; 
if 6' is not aggregative then 

6f := 6'j^ U {the relations corresponding to , according to the mapping 0}; 

L) {the conditions corresponding to , according to the mapping 0} 

else 

•“ {^ } {the relations corresponding to , according to the mapping 0}; 
•= {the conditions corresponding to , according to the mapping 0} 

endif 
endcase ; 

return ; 

end 



Fig. 5. The query translation algorithm embedded in a relational DWI 
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query descriptor, to represent the SELECT, FROM, WHERE, and GROUP BY clauses of 
an SQL query, respectively. According to SQL syntax, in a query descriptor S = 
{Ss,Sf,S^,Sg), Ss denotes a set of (expressions on) attributes possibly involving 
aliases, <5/ denotes a set of relation names and query descriptors that indicate 
nested SQL queries, 6^ denotes a set of conditions of the form Ai = A 2 (where 
Ai and A 2 are attributes), and Sg is a subset of the attributes occurring in S. A 
query descriptor is aggregative if the set Ss contains an aggregate function. 

According to the structure of MDS, the algorithm takes as input a query 
expressed in MV algebra and returns a query descriptor as output, which is 
then transformed into an SQL query. The algorithm operates recursively on the 
structure of the expression in input and transforms elements of the MV algebra 
into elements of relational algebra, according to the mapping 0. 

The basic step corresponds to the case in which an f-table F is encountered 
(first option of the case statement). In this case <5^ takes the relational attributes 
associated with the coordinates and the measures of F and <5/ takes the fact 
table associated with F (see Fig. 4). 

Then, at each step, the query under construction is either extended or nested 
(in the FROM clause); this depends on whether an aggregation has been encoun- 
tered. For instance, when a selection Uc{E') is encountered (second option of 
the case statement) and E' does not involve an aggregation, S^ is extended with 
the condition obtained from C by substituting coordinates and measures with 
the relational attributes associated with them, according to 0. Conversely, if E' 
is aggregative, Sf is replaced by the query descriptor obtained by applying the 
translation algorithm to E' . This corresponds to a nesting in the FROM clause. 
The other cases are similar. 

A special and interesting case is when a roll-up expression {E') is encoun- 
tered (last option of the case statement). If E' is not aggregative, <5/ is extended 
with the tables involved in the integrity constraints that 0 associates with the 
roll-up function R-UPj , whereas <5^, is extended with the join conditions corre- 
sponding to the foreign keys involved (see Fig. 4). If E' is instead aggregative, 
the construction proceeds as above. 

The formal nature of the MV model and language allows us to investigate 
formal issues about the expressive power of DWI’s with respect to query capa- 
bilities of the underlying OLAP servers. 
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Abstract. In the context of multidimensional databases implemented on 
relational DBMSs through star schemes, the most effective technique to 
enhance performances consists of materializing redundant aggregates called 
views. In this paper we investigate the problem of vertical fragmentation of 
views aimed at minimizing the workload response time. Each view includes 
several measures which not necessarily are always requested together; thus, 
the system performance may be increased by partitioning the views into 
smaller tables. On the other hand, drill-across queries involve measures 
taken from two or more views; in this case the access costs may be decreased 
by unifying these views into larger tables. After formalizing the 
fragmentation problem as a 0-1 integer linear programming problem, we 
define a cost function and outline a branch-and-bound algorithm to 
minimize it. Finally, we demonstrate the usefulness of our approach by 
presenting a set of experimental results based on the TPC-D benchmark. 



1 Introduction 

Recently, multidimensional databases have gathered wide research and market interest 
as the core of decision support applications such as data warehouses [1][11]. A 
multidimensional database (MD) can be seen as a collection of multidimensional 
“cubes” centered on facts of interest (for instance, the sales in a chain store); within a 
cube, each cell contains information useful for the decision process, i.e., a set of 
numerical measures, while each axis represents a possible dimension for analysis. 

An MD implemented on a relational DBMS is usually organized according to the 
so-called star scheme [13], in which each cube is represented by one fact table storing 
the measures and one denormalized dimension table for each dimension of analysis. 
The primary key of each dimension table (usually a surrogate key, i.e., internally 
generated) is imported into the fact table; the primary key of the fact table is defined 
by the set of these foreign keys. Each dimension table contains a set of attributes 
defining a hierarchy of aggregation levels for the corresponding dimension. 

The basic mechanism to extract useful information from elemental data in MDs is 
aggregation. In order to improve the system performance for a given workload, an MD 
typically stores, besides the elemental values of measures, also values summarized 
according to some aggregation patterns, i.e., sets of attributes taken from dimension 
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tables to define the coarseness of aggregation. Even data summarized according to each 
pattern are organized into a star scheme, whose fact table is called a view and imports 
the attributes included in the pattern; measure values are obtained by applying an 
aggregation operator to the data in another fact table with a finer pattern. In the 
following we will use the term view to denote either the fact tables containing 
elemental values (primary views) or those containing aggregated values (secondary 
views). 

Since pre-computing all the possible secondary views is unfeasible, several 
techniques have been proposed to select the subset of views to materialize in order to 
optimize the response to the workload (e.g, [3][12][19]). In this paper we investigate 
how the response can be further enhanced by fragmenting views vertically. By vertical 
fragmentation we mean the creation of fragments of views, each including measures 
taken from one or more views with the same pattern as well as the key associated to 
that pattern. Fragmentation may achieve two goals together: partitioning the measures 
of a view into two or more tables, and unifying two or more views into a single 
table. While partitioning may be useful whenever only a subset of the attributes is 
typically required by each query, unification may be convenient when the workload is 
significantly affected by drill-across queries, i.e., queries formulated by joining two or 
more views deriving from different cubes. 

It is remarkable that partitioning entails no significant storage overhead. In fact, on 
the one hand, surrogate keys require a few bytes to be stored. On the other, though on 
primary views the number of dimensions may exceed the number of measures, this is 
less likely on secondary views for two reasons. Firstly, it may be necessary to include 
in them also derived measures and support measures for non distributive aggregation 
operators [9][10]. Secondly, within a coarse aggregation pattern, one or more 
dimensions may be completely aggregated (in this case, the corresponding foreign key 
is dropped from the key of the view). 

As compared to operational databases, in MDs the benefits of fragmentation are 
further enhanced by the multiple query execution plans due to the presence of 
redundant secondary views. These benefits are particularly relevant if the MD is 
implemented on a parallel architecture; if disk arrays are adopted and fragmentation is 
coupled with an allocation algorithm, the queries requiring multiple fragments 
allocated on different disks can be effectively parallelized [15] [16]. 

The problem of determining the optimal partitioning given a workload has been 
widely investigated within the context of centralized as well as distributed database 
systems, considering non-redundant allocation of fragments (for instance, see 

[6] [14][16]); unfortunately, the results reported in the literature cannot be applied here 
since the redundancy introduced by materializing views binds the partitioning problem 
to that of deciding on which view(s) each query should be executed. To the best of our 
knowledge, the problem of vertical fragmentation in MDs has been dealt with only in 
[15], where no algorithm for determining the optimal fragmentation is proposed. In 

[7] , views are partitioned vertically in order to build dataindices to enhance 
performance in parallel implementations of MDs. 

In Section 2 we outline the necessary background for the paper. In Section 3 the 
vertical fragmentation problem is formalized, a cost function is proposed and a branch- 
and-bound approach is proposed. Section 4 presents some results based on the TPC-D 
benchmark. 
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2 Background 

2 . 1 Cubes and Patterns 

In this paper we will often need to refer to multidimensional objects from a 
conceptual point of view, apart from their implementation on the logical level. For 
this reason we introduce cubes. 

A multidimensional cube/is characterized by a set of dimensions Pattif), a set of 
measures Measif) and a set of attributes Attrif) Pattif). The attributes in Attrif) are 
related into a directed acyclic graph by a set of functional dependencies a, Uj. From 
now on, by writing a, we will denote both the case in which a, directly determines 
Qj and that in which transitively determines Uj. It is required that 



The cube we will use as a working example, Lineltem, is inspired by a star scheme 
in the TPC-D [18] and describes the composition of the orders issued to a company; it 
is characterized by: 

Patt(Lineltem) = {Part, Supplier, Order, ShipDate}, 

Meas(Lineltem) = (Price, Qty, ExtPrice, Discount, DiscPrice, SumCharge, Tax} 

and by the attributes and functional dependencies shown in Fig. 1, where circles 
represent attributes (in gray the dimensions). 



Parti 



Brand MFGR ODate OMonth OYear „ „ _SNaton SRegion 

-*0 K) *0 Supplier o (O to 

Order 

SMonth SYear 

Type CustorferCrJSion^ftegion^^'P'^®*® ° 




Fig. 1. Functional dependencies in the Lineltem cube 

On relational DBMSs, cubes are usually implemented adopting the star scheme. 
The star scheme for Lineltem is: 

PART (Partid , Part, Brand, MFGR, Type) 

SUPPLIER ( Supplierld , Supplier, SNation, SRegion) 

ORDER l Orderld , Order, ODate, OMonth, OYear, Customer, CNation, CRegion) 
SHIPDATE t ShipDateld , ShipDate, SMonth, SYear) 

LINEITEM (Partid , Supplierld , Orderld , ShipDateld , Price, Qty, ExtPrice, Discount, 
DiscPrice, SumCharge, Tax) 

where LINEITEM is the primary view; the other tables are dimension tables. For the 
sake of simplicity, we will not consider the possibility of normalizing dimension 
tables to obtain snowflake schemes. 

Definition 1. Given a cube/, an aggregation pattern (or simply pattern) on/ is a 
set P Attrif) such that no functional dependency exists between each pair of 

attributes in P\ 
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With reference to the Lineltem cube, examples of patterns are PattiJ), {Part, OMonth, 
SNation}, (Brand, Type}, {}. 

Definition 2. Let P, and Pj be two patterns; we say that P, is coarser than Pj (P^ 
Pj) if 

For instance, (Brand, CRegion} (Brand, Customer, Supplier}. 



2.2 The Workload 

In principle, the workload for a MD is dynamic and unpredictable. A possible 
approach to cope with this fact, adopted in some commercial tools, consists of 
monitoring the actual workload while the MD is operating. Otherwise, the designer 
may try to determine a core workload a priori: in fact, on the one hand, the user 
typically knows in advance which kind of data analysis (s)he will carry out more often 
for decisional or statistical purposes; on the other, a substantial amount of queries are 
aimed at extracting summary data to fill standard reports. 

As to update queries, we believe they should not be included in the workload. In 
fact, MDs are typically updated only periodically, in an off-line fashion, and during 
this process the database is unavailable for querying. Thus, the update process does 
not directly affect the MD performance, and it is sufficient to ensure that it is properly 
bounded in time. 

Definition 3. The workload is a set of pairs (g„ where q', denotes a query and 
, its expected frequency. 

Within the scope of this paper, a query q can be characterized by (1) its pattern, 
Pattiq)', (2) the set of measures it requires, Meas{q)', (3) the selectivity, sel{q), defined 
as the ratio between the number of tuples returned by q and the cardinality of the view 
at Pattiq). For instance, on Lineltem, the query asking for the total quantity of each 
medium polished part ordered from each American supplier is characterized by Pattiq) 
= (Supplier, Part}, Meas(q) = {Qty} and sel(q) = 0.01 (assuming that 5 supplier 
regions and 20 part types are present, that attribute values are uniformly distributed 

and that selection predicates are independent, it is sel(q) = ). 

Part of the queries the user formulates may require comparing measures taken from 
distinct, though related, cubes; in the OLAP terminology, these are called drill-across 
queries. 

Definition 4. Let /[,.../„ be m cubes such that ; 

a drill-across query on /,,.../„ is a query q characterized by Pattiq) = and 
[J , Measiq) Measifi) for i=\,...m. We call the 

projection of q on cube/ the query q^ characterized by Pattiqj) = and Measiq,) = 
Measiq) Measifi). 

Consider for instance the cube Shipment characterized by: 

PattiShipment) = (Part, ShipTo, ShipFrom, ShipMode, ShipDate}, 

MeasiShipment) = (QtyShipped, ShippingCost} 
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and by the attributes and functional dependencies shown in Fig. 2. Since pattern P = 
{Part, Customer, ShipDate} and all the other patterns coarser than P are common to 
Lineltem and Shipment, a possible drill-across query is the one asking for the total 
cost paid by the customers of each region to receive each part, characterized by Patt(q) 
= (CRegion, Part}, Meas(q) = (DiscountPrice, ShippingCost} and sel(q) = 1. 



Brand MFGR 



Part 




Type 

City 



ShipTo 




Customer 



ci®i 



CRegion 



-K) 



ShipFromo 
ShipModeo 
ShipDate < 



SMonth SYear 



Fig. 2. Functional dependencies in the Shipment cube 



2.3 Views 

Given a cube /, each pattern on / determines a candidate view for materialization. 
Several algorithms have been proposed to determine the optimal set of views to be 
materialized, often by significantly reducing the search space [3] [12]. Discussing these 
algorithms is outside the scope of this paper; we will assume that one of them is 
applied to determine, for each cube, an optimal set of views. To the best of our 
knowledge, no workload-based materialization algorithms in the literature takes drill- 
across queries into account; on the other hand, since these queries play a relevant role 
within our workload, it is necessary to involve them in the optimization process. 
Thus, when applying the materialization algorithm, every drill-across query is 
substituted by its projections on the cubes involved. 

Let V be the global set of the (primary and secondary) views to be materialized for 
the MD considered, as determined by the materialization algorithm. Given view v V, 
we will denote with Pattiy) and Meas(v), respectively, the pattern on which v is 
defined (determined by its primary key) and the set of measures it contains. The 
primary view for cube /is characterized by Patt(v) = Pattif) and Meas(v) = Measify, the 
secondary views for/by Pattiy) Pattif) and Measiv) = Measif). 

Given query q and view v, q can be answered on v if Pattiq) Pattiv) and Measiq) 
Measiv). In particular: 

If q involves only measures from a single cube /, it can always be answered on 
the primary view for / or, more conveniently, on any secondary view v for / 
provided that Pattiq) Pattiv). 

If is a drill-across query on cubes/i, .../„, by definition it is Pattiq) Pattif)) for 
each i; thus, q can be solved by first solving all the projections of q on the m 
cubes, then performing a join between the results (the join attributes are those in 
Pattiq)). 
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3 Vertical Fragmentation of Views 

Each view includes measures which describe the same cube but, within the workload, 
may be often requested separately. Thus, the system overall performance may be 
increased by partitioning the views determined from the materialization algorithm into 
smaller tables, each including only the measures which typically appear together 
within the queries. On the other hand, drill-across queries can be solved by joining 
views defined on different cubes. The access costs for these queries may be decreased 
by unifying two or more views on the same pattern into larger tables where all the 
measures required are stored together. 

With the term fragmentation we denote both partitioning and unification of 
(primary or secondary) views. The approach we propose is aimed at determining an 
optimal fragmentation of the views in the set V. 

It is remarkable that the effectiveness of fragmentation for MDs may be higher 
than for operational non-redundant databases; in fact, while in the latter case it is 
known a priori on which table(s) each query will be executed, in MDs the presence of 
redundant views makes multiple solutions possible. In the following we consider an 
example on Lineltem. Let V = {vj, Vj}, where 

Meas(Vi) = Meas(Lineltem); Patt(v^) = {SNation, Brand} 

Meas{vf) = Meas(Lineltem); Patt{vf) = {SNation, Part, ODate} 

Let the workload include two queries g, and <72 defined as follows: 

Meas(qi) = (Price, Qty, Discount, ExtPrice, DiscPrice}; Patt{qi) = (SNation, Brand} 
Measiqf) = (Tax, DiscPrice, SumCharge}; Pattiqf) = (SNation, Brand} 

It is convenient to execute both g, and ^2 on Vi since its cardinality is lower than that 
of V2 (Patt{vi) > Patt(v2)). Now, consider a fragmentation including four fragments: 

Mea5(v’i) = (Price, Qty, Discount, ExtPrice, DiscPrice}; Patt(v\) = Patt(vf 

Meas{v’\) = (Tax, SumCharge}; Patt(v’\) = Patt(vj) 

Meas{v’f) = {DiscPrice}; Patt{v’f) = Patt{vf) 

Meas{v”f) = (Price, Qty, Discount, ExtPrice, Tax, SumCharge}; Pattiv’f) = Patt(v2) 

This solution is optimal for qi, which will be executed on v’,. As to q2, while Tax and 
SumCharge are retrieved from v’\, it might be more convenient to retrieve DiscPrice 
from v’2 rather than from v’,, depending on the trade-off between reading less measures 
and accessing less tuples. In general, another factor to be considered in the trade-off is 
the number of attributes forming the fact table key: for coarser patterns, the length of 
the key is shorter and the size of the tuples read is smaller. The possibility of 
answering a query by jointly accessing fragments of different patterns, impacts on the 
optimization of the query execution tree by enabling additional push-downs and pull- 
ups of group-by operators. 



3.1 Problem Statement 

In principle, the fragmentation algorithm should be applied to the whole set of views, 
V. On the other hand, it may be convenient to unify two measures belonging to two 
different cubes/’ and/’ only if at least two views with the same pattern have been 
materialized on/’ and/” and the workload includes at least one drill-across query on /’ 
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and/’ which can be answered on these two views; in this case, we say that / and/’ 
are strictly related. Two cubes are related if (1) they are strictly related or (2) a third 
cube related to both exists. The transitive notion of relatedness induces a partitioning 
onto the set of cubes belonging to the MD, which in turn partitions the set of queries 
and the set of views according to the cube(s) they are defined on; in order to decrease 
complexity, fragmentation is meant to be applied separately to each set of queries on 
the corresponding set of related cubes. 

Let FS be a set of related cubes and QS be the set of queries on the cubes in FS. 
Let VS V be the set of views materialized on the cubes in FS and PS be the set of 
patterns characterizing the views in VS. 

Definition 5. Given cube / FS, we partition^ Measif) into the largest subsets 
of measures which appear all together in at least one query of QS and do not 
appear separately in any other query in QS'. 



We call each subset a minterm of/ and denote with MSiJ) the set of all minterms 
of/ 

For instance, on the Lineltem cube, given QS = {^ 1 ,^ 2 } where Meas{qi) = {Price, Qty, 
ExtPrice, Discount} and Meas{q2) = (Price, Qty, DiscPrice, SumCharge}, it is 
MS(LineItem) = {{Price, Qty}, {ExtPrice, Discount}, {DiscPrice, SumCharge}}. 

Definition 6. Given the set of related cubes FS, we define a term as a set of 
measures which either is a minterm of a cube in FS or is the union of the 
minterms, even from different cubes in FS, within a set such that 

.We 

denote with TS the set of terms for FS. 

For example above, it is TS = MS(Lineltem) {Meas(qi), Meas^q^)}. 

Given FS, a solution to the fragmentation problem is encoded by a fragmentation 
array, i.e., a binary array C with three dimensions corresponding to, respectively, the 
queries q^ QS, the patterns Pj PS and the terms 7} TS. The set of fragments defined 
by C is 



where fragment y,^ is characterized by Meas{Vj^ = / and PattlVj,,) = Pj. 

A fragmentation array not only denotes a fragmentation of the views in VS ; at the 
same time, it specifies on which fragment(s) each query is assumed to be executed. In 
fact, a 1 in cell Cy*. denotes that, when answering query qj, the measures in 
Meas(q,) / will be obtained from v^^.. 

The fragmentation encoded by C is feasible if the following constraints are 
satisfied: 



We assume that each measure appears in at least one query of the workload. 
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( 1 ) For each query, every measure required must be obtained from exactly one 
fragment (non ambiguous query execution): 



( 2 ) For each pattern, each measure must belong to at most one fragment (non 
redundant fragmentation): 



( 3 ) Each fragment in VS’ must be a fragmentation of one or more views in 
(consistency with the views materialized): 



Though constraint ( 3 ) states that every fragment must derive from a view in VS, 
we are not guaranteed that every measure in every view is included in a fragment. In 
fact, fragments will be produced only for the minterms, among those deriving from 
each view, actually used to answer at least one query. Given a view, we will call lost 
minterms those not included in any term generating a fragment. The fragmentation of 
primary views must necessarily be lossless, thus, every lost primary minterm must 
be reconsidered a posteriori, either by creating a separate fragment or by unifying it 
with one of the fragments determined. On the other hand, as to lost minterms from 
secondary views, it is not obvious whether generating the fragments is convenient or 
not; in fact, the space saved could probably be more profitably employed to store 
indices or additional views. 

In the following we consider a small example on FS = {Lineltem, Shipment}. Let 
QS = {^1, q2, q^, qn, q^}', qi, q2, q^ are defined on Lineltem, q^ on Shipment and q^ is a 
drill-across query: 



Measiqi) = {Price, Qty, Discount}; Patt{q}) = {ShipDate} 

Meas(q2) = (ExtPrice, DiscPrice}; Patt{q^ = (Part, Customer} 

Measiq^) = (SumCharge, Tax}; Pattiq^) = (Part, CNation} 

Measiqi = (QtyShipped, ShippingCost}; Patt{q^) = (CNation, MFGR, ShipDate} 

Measiqi = (ExtPrice, DiscPrice, ShippingCost};Patt(g5) = (Brand, CNation} 



We assume that, besides the primary views Vi and V2, two secondary views V3 and V4 
have been materialized on Lineltem, one secondary view Vj on Shipment: 

Pattiv^) = (Part, Customer}; Patt{v^) = Patt{v^ = (Part, CNation} 



(for each view, the measures are those of the corresponding cube). Fig. 3 shows the 
fragmentation array which represents a feasible solution to this fragmentation 
problem, which features five fragments: 



Meas(v’i) = (Price, Qty, Discount}; 

Measly’^ = {QtyShipped, ShippingCost}; 
Measiv’^) = (ExtPrice, DiscPrice}; 

Mea.'ilv’^) = (SumCharge, Tax}; 

Meas(v’^) = (ExtPrice, DiscPrice, ShippingCost}; 



Patt(v’i) = Patt{LineItem) 
Patt{v ’2) = Patt{Shipment) 
Patt{v ’3) = Patt{v}) 

Pattiy’^) = Patt{v^) 

Patt(v’^) = Patt(v^) = Patt(v^) 
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the first four of which are obtained by partitioning, the last one by coupling 
partitioning and unification. The array also denotes that, for instance, query q', is 



executed on vV 
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Fig. 3. Fragmentation array representing a feasible solution 



3.2 The Cost Function 

Among all the feasible solutions to the fragmentation problem, we are interested in 
the one which minimizes the cost for executing the workload. We believe it is 
convenient to keep logical design separate from the physical level in order to both 
provide a more general solution and reduce complexity; thus, the cost function we 
propose intentionally abstracts from any assumptions on the access paths, being based 
on the number of disk pages in which the tuples of interest for a given query are 
stored. In particular, the cost of query within fragmentation C is defined as: 



where: 

ns{P^ is the cardinality for the view on pattern Pj (estimated for instance as 
shown in [8]). 

sel{q,) is the selectivity of qf, thus, sel(q,)-ns{PJ) is the number of tuples of the 
view on pattern Pj which must be accessed in order to answer g,. 
ji, is the number of tuples per disk page for fragment characterized by Pj and 
T,: 
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Thus, is the number of pages in which Vj^- is contained. 



, is the expected number of pages in which the tuples 

necessary for are stored, estimated with the Cardenas formula [5], 

Thus, cost{qi,C) expresses the total number of disk pages which must be accessed in 
order to solve qj. Though the actual number of pages read when executing the query 
may be higher depending on the access path followed, we believe that this function 
represents a good trade-off between generality and accuracy. 

It should be noted that, whenever two views are unified, the resulting fragment 
may be used to answer not only drill-across queries, but also queries on the single 
cubes; thus, it must contain the union of their tuples. Let f and/’ be two cubes, let 
P be a pattern common to /’ and/’, and let tis’iP) and ns”(P) be the cardinalities of 
views v’ on/ and v” on/”, respectively. The cardinality of the view unifying v’ and 
v” can be estimated as: 



where cs(P) is the product of the domain cardinalities for the attributes in P. 



3.3 A Branch-and-Bound Approach 

The problem of vertical fragmentation (VFP) can be formulated as follows: Find, for 
the binary decision array C, the value which minimizes function 



subject to constraints (1), (2), (3) expressed in Section 3.1. VFP is a 0-1 integer 
linear programming problem like set covering with additional constraints, and is 
known to be NP-hard [17]. Thus, a branch-and-bound approach can be adopted to solve 
it optimally. 

The ingredients of a branch-and-bound procedure for a discrete optimization 
problem such as VFP are [1]: 

(i) A branching rule for breaking up the problem into subproblems. Let VFP be 
the problem of choosing, given a partial solution to VFP represented by an 
“incomplete” array^ CiVFP ), the remaining elements to be set to 1 in the 
complete solution. We denote with SUB(VFP ) the set of subproblems in 
which VFP is broken up; each is defined by choosing one element C,^*. to be 
set to 1 in the partial solution, which means adding to the current solution a 



^ Non feasible since one or more queries cannot be answered (constraint (1) is not 
satisfied). 
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fragment on pattern Pj to be used for retrieving some measures T^. to solve query 
%■ 

(ii) A subproblem selection rule for choosing the next (most promising) 
subproblem to be processed. The element chosen is the one for which ns{P^ 
is minimum and Meas(q,) Z) has maximum cardinality. 

(iii) A relaxation of VFP , i.e. an easier problem VFR whose solution bounds that 
of VFP . We relax VFP by removing constraint (2): in VFR , some measures 
may be replicated in two or more fragments defined on the same pattern. 

(iv) A lower bounding procedure to calculate the cost of the relaxation. VFR 
consists of one set covering problem for each q^, which can be solved by 
adopting one of the algorithms in the literature [4]. Since in solving VFR the 
number of eligible fragments is higher than that for VFP , the cost of VFR 
will be lower or equal to that of VFP . 



4 Experimental Tests 

In this paper we have proposed an approach to vertical fragmentation of views in 
multidimensional databases. The experimental results we present in this section 
confirm the utility of the approach in terms of reduction of the cost for executing the 
expected workload. The tests we have carried out are based on the well-known TPC-D 
benchmark [18], which features two cubes Line Item and PartSupplier with 
cardinalities 6.000.000 and 800.000, respectively; the total amount of data is about 1 
Gbyte. 

We have tested our approach on the Informix DBMS with a workload based on the 
17 TPC-D queries (all with the same frequency). The views to be fragmented have 
been selected by means of the heuristic approach to materialization proposed in [3], by 
considering a global space constraint of 2 GB (1 GB for primary views + 1 GB for 
secondary views); as a result, 1 1 secondary views were created besides the 2 primary 
views. The fragmentation algorithm determined 14 fragments (3 from the primary 
views) and 9 lost minterms. Indices on all the attributes belonging to keys in both 
fact and dimension tables were created. 

Fig. 4 shows, for each query, the ratio between the number of disk pages read 
without and with fragmentation; above each column, the number of disk pages read 
without fragmentation. Overall, fragmentation decreases the workload cost from 
265904 to 59986 pages (more than 4 times). 

Fig. 5 shows how fragmentation affects the total storage space; above each 
column, the storage space without fragmentation. Overall, the unfragmented views 
require 368840 pages; while materializing only the fragments (no lost minterms) 
decreases the space required to 306042 pages (-17.0%), materializing also lost 
minterms increases the space required to 442097 pages (h-19.8%). 

It should be noted that the next view to be materialized beyond the 2 GB constraint 
would take 126460 disk pages, and decrease the workload cost by 1%; fragmentation 
is more convenient since it takes only 73257 extra pages and decreases the cost by 
77%. In fact, while materializing one more view typically benefits few queries, 
several queries may take advantage from using the same disk space for fragmentation. 
Furthermore, while fragmentation does not require extra space for dimension tables, 
each new view may require adding new tuples in dimension tables to be referenced by 
the aggregated tuples in the view. 
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Fig. 4. Ratio between the number of disk pages read without and with fragmentation 
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Fig. 5. Ratio between the storage space with and without fragmentation 

In order to evaluate the algorithm complexity, we have defined four more 
workloads, each progressively extending the TPC-D; the results are shown in Table I. 
The computing time does not depend strictly on the workload size, of course it is also 
determined by the relationships between the queries. 
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Table I. Results of the complexity tests 



n. queries in the workload 


n. subproblems generated 


computing time 


17 


2775 


about 1 min 


25 


4439 


about 2 mins 


30 


348925 


about 30 mins 


35 


51099 


about 12 mins 


40 


403420 


about 75 mins 
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Abstract. Data cubes provide aggregate information to support the 
analysis of the contents of data warehouses and databases. An important 
tool to analyze data in data cubes is the range query. For range queries 
that summarize large regions of massive data cubes, computing the query 
result on-the-fly can result in non-interactive response times. To speed 
up range queries, values that summarize regions of the data cube are pre- 
computed and stored. This faster response time results in more expensive 
updates and/or space overhead. While the emphasis is typically on low 
query and update costs, growing data collections increase the demand 
for space-efficient approaches. In this paper two techniques are presented 
that have the same update and query costs as earlier approaches, without 
introducing any space overhead. 



1 Introduction 

Data cubes are powerful tools to support the analysis of the contents of data 
warehouses and databases. A data cube is similar to a multidimensional array. 
Certain attributes of the database are chosen to be measure attributes. These 
are the attributes whose values are of interest to an analyst. Other attributes are 
selected as dimensions (also called funetional attributes) . The measure attributes 
are aggregated according to the dimensions. A eell of the data cube is described 
by a unique combination of dimension values. An example of a data cube based 
on the TPC-H benchmark database [9] would have the total price of an order 
as the measure attribute and the region of a customer and the order date as 
the dimensions. Such a data cube provides the aggregated total orders for all 
combinations of regions and dates. Queries issued by an analyst who wants to 
examine how the customer behavior in different regions changes over time (e.g., 
in order to evaluate the success of local advertising campaigns) do not need to 
access and join the “raw” data in the different tables. Instead the information is 

* This work was partially supported by NSF grants EIA-9818320, IIS-98-17432, and 
IIS-99-70700. 

Y. Kambayashi, M. Mohania, and A M. Tjoa (Eds.): DaWaK 2000, LNCS 1874, pp. 24-33, 2000. 

© Springer- Verlag Berlin Heidelberg 2000 
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available and summarized from the data cube. Note that our data cube notion 
differs from the terminology used in [5]. We do not augment the data cube 
with pre-computed results of GROUP-BYs of subsets of the set of all dimension 
attributes. Thus our data cube notion corresponds to the data cube core in [5]. 

Aggregate range queries are useful analysis tools on data cubes. Such a range 
query aggregates the values of those cells that satisfy the range selection condi- 
tion for all dimensions. For instance, a range query on our example data cube 
could “Find the total amount of orders in California over the last four months” . 
Queries of this form are useful in discovering relationships between attributes in 
the database. 

Analyzing data online is a highly interactive process. Analysts expect fast 
responses to their queries, ideally in the order of seconds at most. For massive 
data sets, however, range queries that access and aggregate on-the-fiy the con- 
tents of a large number of cells, will show slow response times. To speed up those 
queries, the aggregates for sets of cells are pre-computed and stored in the data 
cube. This leads to well-known tradeoffs. Storing additional pre-computed values 
results in space overhead. Also, updates become more expensive when an up- 
date to a single cell triggers updates to all pre-computed values that include this 
cell in their aggregation. Different applications tolerate different update costs. 
While what-if scenarios and stock trading applications require fast updates, for 
other applications overnight batch processing of updates suffices. But even batch 
processing benefits from faster updates, since they reduce the size of the update 
window and allow for more frequent updates and shorter inaccessibility of the 
data. Ideally a data cube should support fast queries and fast updates at no 
extra storage cost. 

An elegant algorithm for computing range queries that return the sum of the 
selected cells in data cubes is presented in [6]. We refer to it as the Prefix Sum 
technique (PS). The essential idea is to pre-compute the prefix sums of the data 
cube (see Fig. 2), which are used to answer ad hoc queries in constant time. Since 
the prefix sums replace the original values in the cells, the PS technique does 
not require additional space. The approach is mainly hampered by its update 
costs. In the worst case an update to a single cell requires recomputing the whole 
array, which is of the same size as the original data cube. 

To reduce the high update costs, while still guaranteeing a constant query 
cost, the Relative Prefix Sum technique (RPS) [3] controls the cascading updates. 
This comes at the cost of a space overhead. In contrast, the Hierarchical Cubes 
techniques (HC) [1] do not require additional space. HC generalize the idea of 
RPS by allowing different tradeoffs between update and query cost. The tradeoff 
is selected by setting parameters that control the generation of the pre-computed 
values. Consequently the query and update costs depend on those parameters as 
well as the dimensionality and the size of the data cube. This makes a general 
comparison of HC to the other techniques difficult. For instance, while for some 
data cubes one of the HC techniques might provide a parameter setting that 
leads to a better query and update behavior than RPS, for other data cubes this 
is not the case. 
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The only technique that guarantees that query and update cost are both 
sublinear in the domain size of the dimensions for any data cube is the Dy- 
namic Data Cube (DDC) [2]. The space overhead of this technique, however, is 
significant. 

For massive data sets the space requirements of a technique become a decisive 
factor. Space overhead not only leads to extra costs for storage devices, but also 
causes additional propagations of updates and longer access times on the physical 
devices. 

In this paper we present two new space-efficient data cube techniques - SRPS 
and SDDC - based on RPS and DDC, respectively. Both techniques inherit the 
update and query costs of their predecessors, but considerably reduce the space 
requirements. More precisely, they have the same storage requirement as the 
original data cube, thus do not introduce any space overhead. They are suitable 
for data warehousing environments, especially decision support and OLAP appli- 
cations. SRPS efficiently supports applications where queries dominate. SDDC 
balances the costs of queries and updates. Thus it is especially appropriate in 
settings with frequent updates and enables users to analyze what-if scenarios. 

In Sect. 2 we describe the SRPS and SDDC techniques. Both techniques are 
compared to their predecessors RPS and DDC, respectively. Sections concludes 
this article. 

2 The SRPS and SDDC Techniques 

In this section SRPS and SDDC are presented and compared to RPS and DDC, 
respectively. The following notation will be used. Let A be a data cube of dimen- 
sionality d, and let c = [ci , . . . , c^] be a cell that contains the value A[c\ . Without 
loss of generality let the domain of each dimension attribute i be {0, 1, . . . , n — 1}. 
e : / is a region of the data cube, more precisely the set of all cells c that satisfy 
< Ci < fi for all 1 < i < d (i.e., e : / is a hyper-rectangular region of the data 
cube). Cell e is the anchor and cell / the endpoint of the region. Consequently 
the entire data cube is anchored at [0, . . . , 0] and ends at [n — 1, . . . , n — 1]. The 
set of the values in region e : / is denoted A[e] : A[f], and op(A[e] : A[f]) is the 
result of applying the aggregate operator op to those values. 

SRPS and SDDC make use of the inverse property of some aggregation op- 
erators. They can be applied to any operator © for which there exists an inverse 
operator 0 such that (a(Bb) Qb = a (e.g., SUM, COUNT). For the SQL operator 
SUM (sum of the values of the selected cells) each region’s sum can be obtained 
by adding and subtracting sums for appropriate regions that are anchored at 
[0, . . . , 0]. We will refer to a region that is anchored at [0, . . . , 0] as a prefix re- 
gion; a query that selects such a region is a prefix query. Note that according 
to [6] any range sum can be computed by combining the range sums of up to 
2'’* (which is a constant) prefix regions. Thus the problem of computing the sum 
for an arbitrary range is reduced to the problem of efficiently computing prefix 
sum queries. We will therefore only describe how SRPS and SDDC solve this 
problem. 
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The SRPS and SDDC techniques are described for the aggregate operator 
SUM. Other operators for which there exists an inverse operator can be handled 
similarly. In our analysis query and update costs are expressed in terms of the 
number of accessed cells of the data cube. The storage cost is measured in terms 
of cells as well. 



2.1 RPS: The Relative Prefix Sum Technique 

In this section we give an overview of the RPS technique [3] . A detailed correct 
analysis for RPS for high-dimensional data cubes can be found in [4]. Like the 
Prefix Sum technique, RPS reduces the problem of summarizing any possible 
range to the problem of summarizing and combining prefix regions. The main 
idea of RPS is to avoid the cascading updates of the Prefix Sum technique by 
dividing the data cube into smaller chunks of equal size, called overlay boxes. 
The prefix sums are computed and stored relative to the anchor cell of an overlay 
box. The array with those relative prefix sums has the same size as the original 
data cube. Since the relative prefix sums only provide aggregate information 
about the cells inside the overlay box, an additional data structure - the overlay 
array - is used. The overlay array provides sums for regions of cells outside the 
overlay boxes. Together the overlay and the relative prefix sum array guarantee a 
worst case cost of 2*^ for prefix queries, a worst case cost of 2^*^ for general range 
sum queries, and a worst case update cost of (2i/n — 2)*^. Compared to directly 
storing the original data cube, the RPS technique incurs a space overhead of the 
size of the overlay array. Depending on the parameters (dimensionality, size of 
the data cube and the overlay boxes) this overhead ranges from a few percent 
up to almost 100% of the data cube size in some settings. 



2.2 SRPS: The Space-Efficient Relative Prefix Sum Technique 

Like the Relative Prefix Sum method, SRPS provides constant-time queries with 
an update complexity of 0(n'^/^), compared to an update cost of 0(n'^) for the 
Prefix Sum technique. SRPS improves on RPS by not incurring the additional 
overlay array and thus removing the space overhead. 



Description of the Technique. The data cube is completely partitioned into 
a set of disjoint hyper-rectangles of equal size. We will refer to those hyper- 
rectangles as boxes. For clarity and without loss of generality let the length of a 
box in each dimension be k. 

Let R be a box that is anchored at cell a = [oi , . . . , a^] . Then box B contains 
all cells c that satisfy Oi < Ci < Oi + k for all 1 < i < d. The box cells on the 
“upper left” surfaces, i.e., all those cells that agree with the anchor cell a in at 
least one coordinate, are referred to as border eells. The other cells, i.e., those 
cells c with a, -I- 1 < c, < a, -I- fc for all 1 < i < d, are inner eells. Essentially 
inner cells only store sums local to the box, while border cells include cells from 
outside the box into their aggregation. 
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Fig. 1. Computation of border values as the sum of the values of the cells in the shaded 
area on array A 
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Fig. 2. Prefix Sum compared to SRPS 



Any cell c in box B stores the value SUM(A[/i ,l 2 , - ■ ■ ,ld] ■ ^[c]) where 

“ “ = a, + 1 , it a, + 1 < Cj < a, + K . 

Note, that border cells for at least one dimension i satisfy c, = a,, while for 
inner cells the second inequality (a, + 1 < c, < a, + k) holds for all dimensions. 
We will use the term aggregation region for the described regions of cells. Border 
cells aggregate hyper-rectangular regions of cells that stretch from the surface 

of the box to the corresponding surface of the data cube. Inner cells c store 

SUM(A[ai -I- 1 , 02 + 1, . . . , fld + 1] : A[c]), which is the prefix sum relative to cell 
[oi -I- 1, 02 + 1, . . . , Od + 1] . Figure 1 shows aggregation regions (shaded) for border 
cells of a two-dimensional data cube. In the example in Fig. 2 the original data 
cube and the corresponding SRPS cube are shown. 

SRPS by definition does not cause any space overhead. All pre-computed 
values “fit” into an array of the size of the data cube. Once the SRPS cube is 
constructed, the original data cube can be discarded. All queries and updates are 
directed to the SRPS cube. The space savings compared to the RPS technique 
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(a) Partitioning the query region 
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(b) Query: 99+18+52+2=171 
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(c) Update to [5,1] 



Fig. 3. Querying and updating an SRPS cube 



can be considerable. For RPS a prefix array of size (i.e., size of the original 
data cube) and an overlay array of size {n/k)'^{k'^ ~ {k — 1)'^) = n^{\ — 
were stored. Thus SRPS saves storage of the size n^{\ — 

Querying SRPS. As mentioned earlier, any range sum query can be an- 
swered by combining the results of up to 2*^ appropriate prefix queries. Let 
q = [gi , . . . , gd] be the endpoint of the prefix region and a = [oi , . . . , a^] be the 
anchor of box B that contains q. Then the query region [0, . . . , 0] : q can be 
partitioned into non-overlapping regions which are identical to the aggregation 
regions of the cells in the set {c = [ci, . . . ,Cd]|Vi : 1 < i < dAc, € {a,; q%}}- This 
set contains at most 2*^ cells. Intuitively they are obtained as the “projection” of 
cell q to the surfaces of box B that contain the border cells (including cell q it- 
self). Details about the partitioning are provided in [8]. Note, that the result for 
a prefix query can be obtained by adding values from a single box, which results 
in a high locality of accesses. Since a prefix query can be answered at a cost of 2*^, 
the overall worst case range query cost for SRPS becomes 2^ *2^ = 2“^^. Hence 
the query cost is constant irrespective of n, the size of the dimension domains 
of the data cube. Figure 3(a) shows an example for the partitioning of the query 
region for a two-dimensional data cube. The shaded cells in Fig. 3(b) need to be 
accessed in order to compute the prefix range sum. 



Updating SRPS. In general an update to a single cell affects all those cells 
that store a pre-computed value that depends on that cell. Figure 3(c) shows an 
example. An update to cell (5, 1) (marked with *) has to be propagated to each 
of the shaded cells. 

To keep the description simple, we assume that k, the side- length of each box, 
evenly divides n, the side- length of the data cube. Clearly the number of cells 
that are affected by an update to a cell u is equal to the number of aggregation 
regions that contain u. From the definition of the aggregation regions it follows 
that at most {n/k + k — 2Y aggregation regions contain cell u. This bound is 
tight; it is met when cell u = [1, 1, . . . , 1] is updated. Note, that the aggregation 
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regions that contain the updated cell are well defined. The cells that need to 
be updated are the endpoints of those regions. Details of the analysis are not 
provided here due to space limitations and can be found in [8]. 

The update costs are minimal for k = i/n, resulting in a worst case update 
cost of (2i/n — 2Y = 0(n'^/^). Changing k does not affect the worst case query 
costs. ^ Consequently, choosing k = i/n results in the optimal SRPS cube. 



2.3 DDC: The Dynamic Data Cube Technique 

In this section an overview of the Dynamic Data Cube technique [2] is given. As 
in PS, RPS, and SRPS the answer to an arbitrary range sum query is obtained 
by combining the results of the corresponding prefix queries. 

The basic DDC technique makes use of non-intersecting boxes which store 
pre-computed values that only summarize the cells in the box. Those values are 
stored in the “lower right” surfaces of the box (border cells) and summarize the 
cells in a region that has the anchor of the box as its anchor and the surface cell 
as the endpoint. The boxes are organized into a tree that recursively partitions 
the original data cube. The root node encompasses the entire data cube. It forms 
children by dividing its range in each dimension in half. Each of the children are 
in turn subdivided into children, and so on. 

The values in the border cells are cumulative. Thus an update to the anchor 
cell of a box has to be propagated to all border cells in the box. To reduce 
the update cost [2] introduces the tree. trees are standard B-trees whose 
non-leaf nodes are augmented by an auxiliary value that stores the sum of the 
leafs in the left sub-tree. By taking advantage of these auxiliary values, B'^ trees 
provide balanced query and update costs of log m for any one-dimensional array 
of size TO that stores cumulative values. By storing the (one-dimensional) border 
cell arrays in B'^ trees, update and query costs of O(log^n) can be achieved 
for two-dimensional data cubes. For data cubes with d > 2 dimensions the 
{d — l)-dimensional surfaces that contain the border cells are recursively stored 
as {d — l)-dimensional data cubes. Thus DDC with B'^ trees and the recursive 
technique for storing the border values guarantees the polylogarithmic update 
and query costs of 0(log‘*n). 



2.4 SDDC: The Space-Efficient Dynamic Data Cnbe Techniqne 

Like the Dynamic Data Cube, SDDC balances query and update costs to 
O(log‘*n). Those bounds are maintained at much lower storage costs. For clarity, 
we will first describe a simpler basic approach that has higher update costs, and 
then SDDC with the polylogarithmic costs. 



^ The only exception occurs for = 1, when SRPS collapses to the Prefix Sum tech- 
nique. 
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Basic SDDC cube with query example: SUM(A[0,0];A[7,8]) = (126+65+42)+(10+13) = 256 



Fig. 4. Original data cube and corresponding SDDC cube 



The Basic SDDC. To construct the SDDC cube, the data cube is first parti- 
tioned into boxes using the same technique as SRPS, except for two differences. 
First, the side-length of a box is set to A: = n/2, i.e., the data cube is partitioned 
into 2*^ boxes of equal size. Second, while the aggregation regions of the border 
cells remain the same, the inner cells do not store relative prefix sums any more. 
Instead a recursive approach is taken. For each box the region that contains all 
inner cells of that box (which is a hyper-cube of side-length n/2 — 1) is parti- 
tioned into 2*^ non-intersecting boxes of equal size. Their regions of inner cells 
are then in turn partitioned into 2*^ boxes, and so on. Conceptually the boxes 
of the basic SDDC form a tree where each node corresponds to a box. The root 
node encompasses the entire data cube. The children of a node are those smaller 
boxes that partition the regions of the inner cells of the node. They store the 
corresponding border cell values. At the leaf level nodes simply store the value 
of the single cell that corresponds to the node. Since the side-length of a node is 
less than half the side-length of its parent, the tree height can not exceed logj n. 
Figured presents a data cube and the corresponding basic SDDC. 

Instead of dividing the region of the inner cells in each dimension in half, one 
could alternatively choose other partitionings into boxes. Partitions with flexi- 
ble split positions in the different dimensions can be used to identify similarity 
regions, which could be exploited by operations that can take advantage of low 
variance distributions (i.e., for data compression). 

Due to how the boxes are created, the complete basic SDDC fits into the 
space of the original data cube (with cells) . The storage savings compared to 
the basic DDC approach are considerable. In [8] we show that the basic DDC 
requires more than twice the space of the original data cube. Thus our new basic 
SDDC technique reduces the space overhead by more than the size of the original 
data cube! 

To find the sum for any prefix region, the tree is descended and the ap- 
propriate border values are added. On a tree level the query is answered as 
described for SRPS. The only difference is that instead of accessing an inner 
cell, the query recursively accesses the corresponding child node (see [8] for de- 
tails). In the example in Fig. 4 the cells that are accessed in order to compute 
SUM(A[0,0] : A[7,8]) are shaded; cell [7,8] is hatched. Since the partitioning of 
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the data cube is non-overlapping, at most one box per level contains the end- 
point of the query region. For each such box at most 2'^ — 1 border cells have to 
be accessed (analysis identical to SRPS). Since the tree has at most logj n levels, 
the cost for any prefix query is less or equal than (2*^ — 1) log 2 n = O(logn). 

Updates on the basic SDDC are very expensive in the worst case. Consider 
an update to cell [1,1,..., 1]. This update already affects 0(n'^“^) border cells 
in the root node (see [8] for details). 



SDDC with Improved Updates. The problem the basic SDDC faces regard- 
ing updates is similar to the update problem for the basic DDC technique. To 
reduce the update costs, we can apply the same technique as for DDC, i.e., using 

trees for balanced update and query costs on two-dimensional data cubes, 
and storing the border values of higher-dimensional data cubes recursively (see 
Sect. 2.3). However, B'^ trees and the recursive approach introduce unnecessary 
redundancy. We follow a similar approach, but remove the additional storage 
requirements. 

Recall that the values of the border cells in the same surface are cumulative, 
which results in the high worst case update costs. To reduce the costs for one- 
dimensional arrays of border cells we use an elegant technique that embeds a 
tree into the array. The main idea is to first replace the cumulative values by the 
corresponding differences of the values of neighboring cells and then to apply 
the basic SDDC technique to this array of differences. Queries and updates are 
processed as described for the basic SDDC technique, resulting in a worst case 
cost of logn for both operations. Thus the B'^ tree is replaced by a data structure 
which does not add any space overhead compared to storing the original array. 

The DDC technique stores (d — l)-dimensional surfaces of border cells re- 
cursively as (d — l)-dimensional data cubes. Since the surfaces are overlapping, 
redundancy is introduced. SDDC removes this redundancy by ensuring that val- 
ues in the overlapping regions are stored only once. The idea is to embed the 
recursively computed values into the space of exactly those values they replace. 

Note that improving on the B'^ trees and the recursive technique for stor- 
ing the values of the border cells further increases the space savings of SDDC 
compared to DDC. Due to the improvements, SDDC has the same storage eon- 
sumption as the original data euhe. Its query and update costs are 0(log‘*n), 
which is sublinear in the side-length of the data cube. A more detailed descrip- 
tion of the technique can be found in [8] . 



3 Conclusion 

Aggregate range queries are useful tools for analyzing information that is stored 
in data cubes. For massive data sets, however, accessing and aggregating the 
relevant data on-the-ffy can result in slow responses that negatively affect the 
analysis process. In this paper two new techniques were discussed that speed 
up range queries by storing pre-aggregated information, while still supporting 
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efficient updates. Using SRPS or SDDC cubes instead of the original data cube 
provides efficient queries and updates without introducing any space overhead. 

To be more precise, we developed one technique (SRPS) that guarantees 
that any aggregate range query is answered in constant time and that no update 
results in costs higher than the square root of the data cube size. We presented 
another technique (SDDC), that improved the only existing technique which 
provides provably polylogarithmic worst case query and update costs for any 
data cube. Our technique guarantees the same query and update costs, while 
reducing the space overhead by an amount of space that is greater than the size 
of the original data cube. 

Thus our new techniques efficiently support online aggregation for massive 
data sets. Reducing the space requirements not only saves storage costs, but at 
the same time reduces the real access and update times. This is not reflected in 
our cost formulas that are only based on cell accesses. Real access costs, how- 
ever, also depend on cache sizes and I/O times for external storage devices. For 
real query and update costs, smaller space consumption can be very beneficial. 
Similar to all methods that are based on prefix sums, our approaches are par- 
ticularly suited for dense data sets. We are currently developing techniques for 
sparse high-dimensional data cubes [7]. Also, to compare our techniques to previ- 
ous approaches in more detail, further theoretical and experimental evaluations 
are needed. 



References 

[1] C.-Y. Chan and Y. E. loannidis. Hierarchical cubes for range-sum queries. In Proc. 
Int. Conf. on Very Large Databases (VLDB), pages 675-686, 1999. 

[2] S. Geffner, D. Agrawal, and A. El Abbadi. The dynamic data cube. In Proc. Int. 
Conf. on Extending Database Technology (EDBT), pages 237-253, 2000. 

[3] S. Geffner, D. Agrawal, A. El Abbadi, and T. Smith. Relative prefix sums: An 
efficient approach for querying dynamic OLAP data cubes. In Proc. Int. Conf. on 
Data Engineering (ICDE), pages 328-335, 1999. 

[4] S. Geffner, M. Riedewald, D. Agrawal, and A. El Abbadi. Data cubes in dynamic 
environments. Data Engineering Bulletin, 22(4):31-40, 1999. 

[5] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pel- 
low, and H. Pirahesh. Data cube: A relational aggregation operator generalizing 
group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, pages 
29-53, 1997. 

[6] C. Ho, R. Agrawal, N. Megiddo, and R. Srikant. Range queries in OLAP data 
cubes. In Proc. Int. Conf. on Management of Data (SIGMOD), pages 73-88, 1997. 

[7] M. Riedewald, D. Agrawal, and A. El Abbadi. pCube: Update-efficient online 
aggregation with progressive feedback and error bounds. In Proc. Int. Conf. on 
Scientific and Statistical Database Management (SSDBM), 2000. To appear. 

[8] M. Riedewald, D. Agrawal, A. El Abbadi, and R. Pajarola. Space-efficient data 
cubes for dynamic environments. Technical Report TRCSOO-05, UC Santa Barbara, 
2000 . 

[9] Transaction Processing Performance Council. TPC-H benchmark (1.1.0). Available 
at http://www.tpc.org. 




On Making Data Warehouses Active 



Michael Schrefl^ and Thomas Thalhammer^ 

^ University of South Australia, Advanced Computing Research Center 
schref K§cs . unisa . edu . au 

^ University of Linz, Data and Knowledge Engineering 
thalhammerOdke . uni-linz . ac . at 



Abstract. Data warehouses, which are the core elements of On-Line An- 
alytical Processing (OLAP) systems, are passive since all tasks related 
to analyzing and making decisions must be carried out manually. This 
paper introduces a novel architecture, the active data warehouse, which 
applies the idea of event-condition-action rules (ECA rules) from active 
database systems to automize repetitive analysis and decision tasks in 
data warehouses. The work of an analyst is mimicked by analysis rules, 
which extend the capabilities of conventional ECA rules in order to sup- 
port multidimensional analysis and decision making. 



1 Introduction 

Database systems are the core elements of information systems that support On- 
Line Transaction Processing (OLTP) in day-to-day business. Originally, every 
action in a (passive) database system had to be initiated by explicit user interac- 
tion or application request. Event-condition-action (ECA) rules were introduced 
to extend (passive) database systems with reactive capabilities. An ECA rule 
relys on the basic idea that when a specified happening occurs (the event) and a 
certain boolean expression evaluates to “TRUE” (the eondition) then an appro- 
priate aetion will be carried out automatically without explicit user interaction 
or application request. A database system that supports ECA rules is referred 
to as aetive database system. 

Data warehouses, which are the core elementes of On-Line Analytical Pro- 
cessing (OLAP) systems [6], are passive. Although most tasks related to extract- 
ing, transforming, and loading data from various operational OLTP systems can 
be automized [5] , all tasks related to analyzing data and making decisions must 
be carried out manually. In decision making, analysts carry out different kinds 
of decision tasks. In the case of routine deeision tasks, analysts apply well estab- 
lished decision criteria on routine problems through standardized analyses that 
are carried out on a day-to-day basis. In the case of non-routine deeision tasks, 
analyses are carried out on the basis of “trial-and-error” , since decision criteria 
are vague and cannot be formalized. Einally, semi-routine deeision tasks contain 
elements of routine decision tasks and non-routine decision tasks. We employ 
techniques from active database systems to automize routine decision tasks and 
the routinizable elements of semi-routine decision tasks in data warehouses. The 
resulting architecture is called aetive data warehouse (cf. Pig. 1). 

Y. Kambayashi, M. Mohania, and A M. Tjoa (Eds.): DaWaK 2000, LNCS 1874, pp. 34-46, 2000. 
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Manual Feedback 




Fig. 1. Architecture of an Active Data Warehouse 



In analyzing data for a routine decision task (e.g., change the price of an 
article), analysts follow a “top down” approach by initially inspecting a multi- 
dimensional cube on very coarse dimension levels (e.g., inspect the annual sales 
figures of the article). The outcome of analyzing the cube is (1) the decision to 
perform the task, or (2) the decision not to perform the task, or (3) the decision 
to continue analysis on finer levels of granularity (e.g., sales figures are inspected 
on a monthly basis). To enable automatic decision making in an active data ware- 
house, we introduce a data structure - the analysis graph - which represents all 
cubes that are required to analyse a certain business problem. For each cube 
there are three conditions that represent the choices in making decisions. To 
automize analysis and decsion making for routine decision tasks, we introduce 
analysis rules as an extension to EGA rules from active database systems. We 
introduce an event model and an action model to extend data warehouses with 
reactive capabilities. The event of an analysis rule is a timepoint, at which the 
rule has to be fired. An analysis rule’s action is the decision to execute or not 
to execute a transaction in an OLTP system. Analysis rules realize the process 
of decision making by evaluating the conditions associated with a cube of the 
analysis graph in a “top down” fashion. 

The remainder of this paper is organized as follows. In Section 2, we present 
an event model and an action model for active data warehouses. Section 3 in- 
troduces the analysis graph. Section 4 introduces analysis rules and a model for 
decision making. Finally, Section 5 summarizes our achievements. 



2 Events and Actions 

In this paper, we view data in an active data warehouse from an object-oriented 
perspective, which is independent from a certain implementation (e.g., ROLAP, 
MOLAP). A basic assumption is that the structure of an object type (i.e., its 
attributes, relationships) in an OLTP database will be represented by a dimen- 
sion level and a level description of one of the data warehouse’s dimensions (as 
described in [7]). As we will see, the object-oriented view is necessary to extend 
the architecture of a conventional data warehouse with an action model and an 
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event model. In the remainder of this section, we describe the event model and 
the action model from an object-oriented view. 

2.1 Event Model 

An active data warehouse uses events to determine the timepoints, at which 
analysis and decision tasks are initiated automatically. These timepoints can be 
either known in advance (e.g., “at the end of every quarter ...”) or as a reaction 
to recent happenings in the data warehouse’s source systems (e.g., “twenty days 
after the price of an article was changed ...”). We distinguish between event 
classes and event types. Event classes collect instances of event types. Event 
types define the structure of their instances, providing at least an occurrence 
time and a unique event identifier for each instance. An instance of an event 
type is called event. Due to the typical temporal nature of analysis and decision 
tasks, active data warehouses use an event model that is more primitive than 
the event model of active database systems [10,3,4]. Our event model provides 
three basic event types: (1) OLTP method events, (2) calendar events and (3) 
relative temporal events. 

An OLTP method event represents the invocation of a method in an OLTP 
database. We extend the data warehouse’s import mechanism to make OLTP 
method events available in the data warehouse. OLTP method events serve as a 
“reference point” to relative temporal events, which are used to initiate analyses. 

Calendar events represent fixed points in time, at which analysis and decision 
processes will be initiated. Calendar classes may be populated either manually 
or automatically with the help of a calendar tool. 

A relative temporal event r is an event, whose occurrence depends on the 
occurrence of another event e, which serves as the reference point of r. The 
dependence on e is twofold: First, the existence of r depends on the existence of 
e (i.e., r occurs only if e occurred). Second, r depends temporally on e (i.e., r 
occurs a specified period after e occurred) . Relative temporal events are defined 
and populated within the data warehouse. 

2.2 Action Model 

The purpose of an analysis rule is to make “decisions” for objects that are 
available in OLTP systems and in the data warehouse. Since we follow an object- 
oriented approach, a “decision” means to invoke or not to invoke a method on 
a certain object in an OLTP system. If we refer to the term “method” in the 
following, we mean methods of an OLTP object type that represent transactions. 
These methods represent the “decision space” of an active data warehouse. To 
make the behavior of an OLTP object type available to analysis rules, the action 
model of an active data warehouse includes (1) the specifications of the OLTP 
object type’s methods together with required parameters, (2) the preconditions 
that have to be satisfied before the method can be invoked on an instance, 
and (3) a conflict resolution mechanism, which solves contradictory decisions of 
different analysis rules. Since an object may be subject to decisions by different 
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analysis rules, decision conflicts may occur. To detect such conflicts, we use a 
“conflict table” covering the methods of a certain OLTP object type. A conflict 
table consists of tupels {mi, m 2 , m 3 ), where mi and m 2 identify two conflicting 
methods and m 3 specifies the conflict resolution method that will be finally 
executed in OLTP systems. We refer to such conflicts as inter-rule eonflicts. If 
a conflict cannot be solved automatically it has to be reported to analysts for 
further manual treatment. 

Example 1. An active data warehouse may invoke methods 
withdrawFromMarket (location: COUNTRY) and changePrice (newPrice : 
DOLLAR) for instances of object type Article in OLTP systems. These two 
methods cause a conflict if the price of an article should be changed and if 
the article should be withdrawn from the market at the same time. Thus, 
the conflict table contains the entry ( withdrawFromMarket , changePrice , 
withdrawFromMarket ), which specifies method withdrawFromMarket as the 
conflict resolution method. 

Decision making in an active data warehouse will usually be carried out in 
an asynchronous manner, i.e., without having access to OLTP systems. Then, 
out-of-dateness of decisions may occur since OLTP systems may have changed 
manually during the period between the data warehouse was refreshed and de- 
cisions are exported to OLTP systems. Such problems are reported to analysts 
for further manual treatment together with non-routinizable analysis cases, for 
which an active data warehouse could not come to a definite decision. Out-of 
dateness of decisions can be recognized by (1) checking whether the object for 
which the method will be executed is still available in the OLTP system and (2) 
by checking whether the method’s precondition still holds for the object in the 
OLTP system. 

3 The Analysis Graph 

The analysis graph represents all multidimensional cubes, which are necessary for 
automatic analysis and decision making. An analysis graph is specified in a “top 
down” fashion once by an analyst and is then used by an analysis rule. By “top 
down” , we mean that cubes will be “derived” from existing more general cubes 
using OLAP operators DRILLDOWN and SLICE on existing cubes by specifying 
only the new dimension level for DRILLDOWN or an additional condition in the 
case of SLICE. This approach of specifying cubes is referred to as incremental 
speeifieation. Although incremental specification of cubes is natural to the way 
an analyst thinks, each cube needs to be created internally by applying OLAP 
operator ROLLUP on the base cube since a more general cube does not provide 
cells at levels of detail needed by DRILLDOWN. Before defining the analysis graph, 
we (1) introduce a new definition of eube that supports incremental specification 
of cubes and (2) describe how analysts specify cubes incrementally using OLAP 
operators DRILLDOWN and SLICE. 

Our data model follows a cube-oriented approach [12], which generally dis- 
tinguishes between the notions of dimension, dimension level, level description. 
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Fig. 2. Retail Data Warehouse Scheme 



base cube (or fact table), and cell (or fact). Together with OLAP operators Slice, 
Drilldown and Rollup these concepts have been well discussed in recent years [9, 
2, 1, 11] and are thus considered to be understood. 

Example 2. Figure 2 represents a typical data warehouse scheme in retail trade. 
Base cube Sales provides information about dollar amounts of sales and 
the sold quantity according to dimensions Product, Location and Time. 
The levels of dimension Time constitute the two hierarchies Date Month 
— >• Quarter Year and Date Week Season, which are the paths for 
Rollup and Drilldown. Similar hierarchies exist for dimensions Product and 
Location. Each dimension level is described by a level description, e.g., dimen- 
sion level Location [Store] is described by level description Store (storeld, 
city, owner). This example serves as a running example in our paper. 

Although there exists a basic agreement on fundamental multidimensional 
concepts, the various definitions of cube [8,1,11] are not sufficient to specify 
cubes incrementally. These approaches lack of one or several components a mul- 
tidimensional cube c must consist of such that cubes can be specified “top down” 
and created “bottom up” from the base cube. These components are: 

1. a cube scheme S = (Ai : Di[li], ..., An : Il„[/„],Mi : dom{Mi), ..., : 

dom(Mk)), which defines the dimension attributes A, and the measure at- 
tributes Mi of c; the domain of A, are the values which belong to a level li 
of a certain dimension Di, denoted as Di\li], 

2. a base cube b{B)^ , from which cells are taken for aggregation, 

3. a set of conditions C, which determine a subset of cells of b{B) (denoted as 
b{B)c) for aggregation, 

4. a list of aggregation functions AGG, which rollup the subset of cells of b(B)c, 
where each fi € AGG will be evaluated for each subset of cells of b(B)c 

^ B is a base cube scheme, whose dimension attributes A, represent the bottom levels 
of the respective dimensions. 
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whose members rollup to the same dimension value for every dimension 
attribute of S. 

Example 3. Suppose that cube SalesArticlesCountriesQuarters con- 
sists of cube scheme (P: Product [Article] , L: Location [Country] , T: 
Time [Quarter] , SalesTotal: DOLLAR). Attributes P, L, and T are dimension 
attributes; attribute SalesTotal is a measure attribute. Base cube Sales 
provides the cells for rollup using function SUM (sales). 

To specify cubes incrementally, we use the three OLAP operators ROLLUP, 
DRILLDOWN, and SLICE. ROLLUP is needed to create the “root” cube from the 
base cube. By applying DRILLDOWN or SLICE to the “root” cube or on a cube, 
which was derived from the root cube (denoted as Cgid), a new cube c„eu,^ will 
be created. 

If applied to Com with schema Sold = {M : Z)i [/i], ..., A„ : D„[/„],Mi : 
dom(Mi), Mk ■ dom{Mk)), OLAP operator DRILLDOWN A, TO de- 

fines c„ew as follows: (1) S^ew = (Ai : Di[k], At-i : A-i A* : 
Di[h„,„],Ai+i : Di+i[li+i],...,An : £>„[/„], Mi : dom{Mi), Mu ■ dom(Mu)), 

(2) h{B)new = b{B)old, (3) AGGnew = ^GGgld, and (4) Cnew = Cold- 

If applied to Coid, OLAP operator SLICE <cond> defines Cnew as follows (1) 
Snew — Sold: (2) l^{B)new — l^{B)old: (3) AGGnew — AGG old: and (4) Cnew — 
Cold U {<cond>}. 

We provide a simple textual language to specify cubes incrementally. 

DEFINE SalesArticlesCountriesQuarters AS DEFINE SalesAustraliaQuarters AS 

(Product P, Location L, Time T, SUM(Sales) SalesTotal) SLICE L.Name = ^Australia’ 

ROLLUP P TO Article, T TO Quarter, L TO Country FROM SalesArticlesCountriesQuarters 

FROM Sales; 

(a) (b) 

Fig. 3. Incremental specification of cubes 

Example 4- Figure 3 (a) shows the specification of cube 

SalesArticlesCountriesQuarters, which was described in the previous 
example. Hypercube SalesAustraliaQuarters (cf. Fig. 3 (b)) is derived from 
SalesArticlesCountriesQuarters by applying OLAP operator SLICE, which 
selects all cells that comply with condition L.Name = 'Australia’. 

Before we define the analysis graph, we have to define the more speeifie 
relationship between two cube schemes R and S. R is said to be more speeifie 
than S, denoted as R < S, if (1) each dimension attribute A, of R represents at 
least the same dimension level than the equivalent dimension attribute A, of S 
and (2) there are one or more dimension attributes of R which represent a more 
detailed dimension level than the equivalent dimension attributes of S. If R is 
more specific than S, then S is more general than R (denoted as S > R). A cube 
^new i® said to be more speeifie than another cube Coid: denoted as Cnew < Cold: 
iff(l) 

^new ^ ^old Cnew ^ Cold (2) Snew — Sold and Cnew D Cold- 

^ We distinguish the components of Coid and c„em by using the subscript of the re- 
spective cube, e.g., b{B)oid is the base cube of Coid- 
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The analysis graph is a directed acyclic graph AG = (V,E). F is a set of 
cubes, i? C y X y is a set of edges, representing direct “more specific than” 
relationships between cubes: E = {(ci,C2) G V \ ci < C2A G V : ci < 
C3AC3 < C2}. An analysis graph must have a single “root” c^, i.e., Vc G {h^\{cr}} : 
c < Cr- The predicate super (c) denotes the set of direct more general cubes of 
some cube c, i.e., {cj G V \ (c,Cj) G E}. 

OLAP operator DRILLDOWN creates a more specific cube that represents mea- 
sures in more detail and OLAP operator SLICE creates a more specific cube that 
represents a subset of cells. Although analysts specify cubes incrementally by 
using these OLAP operators, the complete set of direct “more specific” edges of 
the analysis graph may be different. 

Example 5. Figure 4 depicts an analysis graph, which consists of 
cubes SalesArticIesCountriesQuarters, SalesAustraliaQuarters, 
SalesAustraliaMonths, Sales AusCitiesQuarters, SalesSydneyQuarters, 
and SalesSydneyMonths. Hypercube SalesArticIesCountriesQuarters 
represents the “root” of the analysis graph. Dashed arrows depict how these 
cubes were specified incrementally by an analyst. Bold arrows depict all direct 
“more specific” edges between these cubes. 

4 Analysis Rules 

Analysis rules are templates for automatic analyses and decision making. The 
basic idea behind analysis rules is to extend conventional EGA rules from active 
database systems with multidimensional features (cubes, OLAP operators) of 
data warehouses and an approach for decision making. 

4.1 Specifying Analysis Rnles 

The specification of an analysis rule consists of (1) the primary dimension level, 
which determines the OLTP object type whose instances are subject to decision 
making, (2) the rule’s condition, which selects the objects of the primary dimen- 
sion level for further rule processing, (3) the event class, which triggers the rule 
if a new event of this class occurs, (4) the analysis graph, whose nodes constitute 
the cubes for analysis, (5) a decision step for each cube of the analysis graph 
representing the rule’s conditions in decision making, and (6) the action, which 
represents the rule’s operational task. 
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DEFINE ANALYSIS RULE HighPricedSoftDrinksQtrAusAR 
FOR P : Product [Article] 

WHERE (P. category = ^SoftDrinksO AND (P.pricePerUnit > 10) 

ON EndOfQuarter eoq 

USING CUBES 

SalesAustraliaThisQuarter AS (Product P, Location L, Time T, SUM(Sales) SalesTotal) 

SLICE L.Name = ‘Australia’ AND T.qtrld = eoq. quarter Id 
FROM SalesArticlesCountriesQuarters ; 

SalesAustraliaThisQuarterMonths AS DRILLDOWN T TO Month 
FROM SalesAustraliaThisQuarter; 

SalesAusCitiesThisQuarter AS DRILLDOWN L TO Country 
FROM SalesAustraliaThisQuarter; 

SalesSydneyThisQuarter AS SLICE L.name = ‘Sydney’ 

FROM SalesAusCitiesThisQuarter; 

SalesSydneyThisQuarterMonths AS DRILLDOWN T TO Month 
FROM SalesSydneyThisQuarter ; 

ANALYZE SalesAustraliaThisQuarter 

TRIGGER ACTION IF SalesTotal < 100,000 
DETAIL ANALYSIS IF SalesTotal < 500,000; 

ANALYZE SalesAustraliaThisQuarter Months 

TRIGGER ACTION IF TREND (si . SalesTotal , s2 . SalesTotal , s3 . SalesTotal) = ‘DOWN’ 

FOR CELLS si, s2, s3 

WHERE sl.T = eoq.lstMon AND s2.T = eoq.2ndMon AND s3.T = eoq.3rdMon; 

ANALYZE SalesAusCitiesThisQuarter 

TRIGGER ACTION IF AVG(SalesTotal) < 15,000 GROUPED BY T 
DETAIL ANALYSIS IF AVG(SalesTotal) < 30,000 GROUPED BY T; 

ANALYZE SalesSydneyThisQuarterMonths 

TRIGGER ACTION IF ( (s2 . SalesTotal - si . SalesTotal) /s2 . SalesTotal) *100 = -10 
FOR CELLS si, s2 

WHERE sl.T = eoq.lstMon AND s2.T = eoq.2ndMon; 

TO EXECUTE P . withdrawFromMarket ( ‘Australia’ ) 

END HighPricedSoftDrinksQtrAusAR 

Fig. 5. Analysis Rule HighPricedSoftDrinksQtrAusAR 

As described in Sect. 2.2, the decision of an analysis rule is to invoke or not 
to invoke a method on the objects of a OLTP object type in an OLTP system. 
Such objects are represented as values of one of the data warehouse’s dimen- 
sions. This dimension is referred to as the primary dimension. The dimension 
level whose values describe the instances of this object type is called primary di- 
mension level. If we compare analysis rules with conventional EGA rules in active 
object-oriented database systems, then the primary dimension level represents 
the object type of pseudo- variable “self” . 

An analysis rule’s action part refers to one of the primary dimension level’s 
OLTP methods that are defined for the OLTP object type and are included in 
the action model of the active data warehouse. 

Example 6. The primary dimension level of analysis rule 
HighPricedSoftDrinksQtrAusAR of Fig. 5 is specified with clause FOR 
P : Product [Article] . In its action part, this analysis rule refers to OLTP 
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method withdrawFromMarket (...), which is defined for OLTP object type 
Article. 

Several analysis rules may share the same OLTP method as their action. An 
analysis rule’s condition is used to classify the entities of the primary dimension 
level to be analyzed by different analysis rules, which all refer to the same OLTP 
method as their action. The condition-part of an analysis rule is a simple boolean 
predicate that refers to describing attributes of the primary dimension level. 

Example 7. In our retail example, high priced “soft drinks” (price per unit 
> $ 10) are analyzed in a different way and at different time points than 
low priced “soft drinks” (price per unit < $ 10). For this purpose, two analy- 
sis rules HighPricedSof tDrinksQtrAusAR and LowPricedSof tDrinksMonAusAR 
are specified, which both refer to withdrawFromMarket (...) as their action. 

The event-part of an analysis rule refers to an event class of the active data 
warehouse’s event model, which may be a calendar class or a relative temporal 
event class (cf. Sect. 2.1). In the case of a calendar class, the analysis rule will 
be fired for all objects of the primary dimension leveF. In the case of a relative 
temporal event class, the analysis rule will be fired exclusively for the object of 
the OLTP event, which serves as reference point to the relative temporal event. 
The set of objects, for which an analysis rule is fired and which comply with the 
analysis rule’s condition is referred to as analysis scope. 

Example 8. Analysis rule HighPricedSoftDrinksQtrAusAR of Fig. 5 will be 
fired for all entities of the primary dimension level Product [Article] at the 
occurrence of an event of type EndOf Quarter. 

Since an analysis rule is specified for a certain primary dimension level, the 
cells of each cube of the analysis graph must describe the instances of the pri- 
mary dimension level. As a consequence, analysts cannot apply OLAP operators 
ROLLUP, DRILLDOWN, and SLICE on a cube’s primary dimension level. The re- 
maining non-primary dimension levels are referred to as analysis dimensions. 

In specifying the cubes of the analysis graph, “inner cubes” will be defined as 
local components of the analysis rule, whereas the analysis graph’s “root” cube 
is a globally available cube, which can be “imported” by the analysis rule. The 
reasons for this separation are the following: 

First, the definition of a cube may depend on the event of the analysis rule, 
e.g., the event may be used by a SLICE condition. Second, the cubes of an anal- 
ysis graph are closely related to the rule’s decision criteria, which reduces their 
reusability for analyses by other analysis rules. Third, changes in the specifica- 
tion of individual cubes will affect a single analysis rule only. There are many 
rules concerning the entities of a certain dimension level. Locality helps here 
to eliminate unwanted side-effects to other analysis rules. Fourth, some general 
cubes should be available to several analysis rules as “starting points” to reduce 
redundancies. 

® Notice that an efficient implementation will execute these rules in a set-oriented 



manner. 
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SalesAustraliaThisQuarter SalesAustraliaThisQuarterMonths 

(a) (b) 

Fig. 6. Identifying cells 

Example 9. Local cube SalesAustraliaThisQuarter is specified by ap- 
plying OLAP operator SLICE L.name = ‘Australia’ AND T.qtrld = 
eoq. quarterld on global cube SalesArticlesCountriesQuarters. 

After analyzing an instance of the primary dimension level using a certain 
cube of the analysis graph, there are three choices available: (a) execute the 
rule’s action for the instance of the primary dimension level, (b) don’t exe- 
cute the rule’s action for the instance of the primary dimension level, and (c) 
continue analysis for the instance on a more specific cube. In automatic anal- 
ysis and decision making, our approach comprises choices (a) and (c), which 
will be represented as boolean conditions referring to the cells of a cube. The 
complement of these two conditions is (b), which represents the decision not 
to execute the rule’s action in the OLTP system. In the following, we refer to 
condition (a) as trigger ActionCond and condition (c) as detail Analy si sCond. 
The triplet {cube, trigger ActionCcmd, detail Analy sisCond) is called decision 
step. Each cube of the analysis graph is analysed by a single decision step. 
Syntactically, a decision step is specified by ANALYZE cube TRIGGER ACTION IF 
trigger ActionCond DETAIL ANALYSIS IF detailAnalysisCond. The predicates 
of the two conditions refer to the cells of cube. 

Each cube describes the entities of the primary dimension level by a sin- 
gle cell or by a sets of cells. If an entity of the primary dimension level is de- 
scribed by a single cell, the decision step’s conditions {trigger ActionCond and 
detailAnalysisCond) can refer to this cell directly. 

Example 10. Eigure 6 (a) illustrates that each entity of the primary 
dimension level Product [Article] (depicted by a bold arc) of cube 
SalesAustraliaThisQuarter represents a single cell, since slice conditions 
Country. Name = ‘Australia’ and Quarter . qtr Id = eoq. quarterld rollup 
to a single dimension value for a given article. Thus, the conditions of the first 
decision step in Eig. 5 refer directly to the cells of this cube. 

If an entity of the primary dimension level is described by a set of cells, then 
least one analysis dimension level describes the primary dimension level with 
more than one dimension value. In this case, the decision step’s conditions refer 
to individual cells by identifying each single cell separately using cell identifiers. 
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Identifying individual cells will be accomplished by using an SQL like query, 
which consists of two components: 

1. The cells-part resembles the FROM clause in SQL. It defines alias names {cell 
identifiers) for the cube of the decision step. All cell identifiers belong to the 
same entity of the primary dimension level, but each cell identifier refers to 
a different cell for analysis. Syntactically, the cells-part is specified by FOR 
CELLS <cell-id 1>, <cell-id 2>, .... 

2. The selection-part represents the selection criterions for these cell identifiers. 
It is similar to the WHERE clause in SQL. A selection criterion is a boolean 
predicate, whose operands refer to the attributes of the level descriptions of 
a cell’s dimension levels or to atomic values. Valid operators are <, >, <, 
>, =, and negations. Syntactically, the selection-part is specified by WHERE 
<selection criteria>. 

Example 11. As illustrated in Fig. 6, the decision step specified for cube 
SalesAustraliaThisQuarterMonths uses cell identifiers si, s2, and s3 to in- 
dentify the sales of the first, second, and third month of a quarter for an instance 
of the primary dimension level Product [Article] . 

If the conditions trigger ActionCond or detail Analy si sCcmd need to refer 
to subtotals of measures of cells describing the same entity of the primary di- 
mension level, an optional aggregation-part offers the possibility to aggregate 
these cells along several analysis dimensions. The aggregation-part is similar to 
the GROUP BY clause in SQL, except that it aggregates the cells describing the 
same primary dimension level implicitly. Thus, only aggregation along analy- 
sis dimensions must be specified explicitly using clause GROUPED BY <analysis 
dimension levels> and aggregation functions, such as SUM, COUNT, and AVG in 
the conditions. 

Example 12. The analysis conditions of the decision step specified for cube 
SalesAusCitiesThisQuarter refers to the measure of cells that were grouped by 
dimension attributes P and T. This measure represents the average SalesTotal 
of Australian cities of the current quarter. 



4.2 Semantics of Analysis Rules 

In decision making, an analysis rule examines for each entity e of the analysis 
scope, whether the rule’s specified OLTP method should be invoked in opera- 
tional systems or whether e should be forwarded to further manual analysis. 

To come to a decision, the rule’s decision steps are evaluated for e. Since there 
may be the situation that an analyst did not specify decision steps for every cube 
of the analysis graph, we introduce for each cube c of the analysis graph V the 
two conditions TAc{e) and DAc{e). TAc{e) represents the decision to execute 
the rule’s action for e after analyzing c and DAde) represents the decision to 
continue analysis at more detailed cubes for e after analyzing c. TAde) and 
DAde) are defined as follows: 
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. \ / trigger ActionCond, 

d A(.(e) — < uflT 



FALSE 



nA ! \ f detail AnalysisCondc 

~ \ true 



if trigger Actioric is defined for c 
else 



if detail Analy si sCcmdc is defined for c 
else 



To mimic the top-down approach in analyzing cubes, we extend the above 
definitions and introduce DA'^{e) as the extended detail analysis condition of 
cube c and TA+ (e) as the extended trigger action condition of cube c. DA'^ (e) 
is defined recursively and is satisfied if one of the direct more general cubes of 
c decides to continue analysis at c and if the detail analysis condition of c is 
satisfied. TA+ (e) is satisfied if the trigger action condition of c is satisfied and 
at least one direct more general cube of c decides to continue analysis at c. 

DA+{e) = DA,{e)A[ \J DA+{e 

\cj Esuper{c) 

TA+(e)=TA,(e)A ( V DA+(e) 

\cj Esuper{c) 

The semantics of an analysis rule is thus defined declaratively by the 
two boolean functions executeOLTPMethod(e) and analy zeManually{e), 
which determine for an entity e, whether the analysis rule’s OLTP method 
should be executed or whether e has to be analyzed manually. Function 
executeOLTPMethod{e) evaluates to “TRUE” if analyzing the cells of at least 
one cube of the analysis graph results in the decision to trigger the rule’s action 
for e. Function analy zeManually{e) evaluates to “TRUE” if analyzing the cells 
of at least one cube of the analysis graph results in the decision to detail analysis 
and no other cube decides to trigger the rule’s action for e. 



1. executeO LT P M ethod(e) = \J TA+(e) 

cev 

2. analyzeManually{e) = ( V DA'^{e)\ A {executeO LT P M ethod(e) = 

Vcgf / 

“FALSE”) 



5 Conclusion and Future Work 

In this paper, we introduced a novel architecture for automatic analysis and de- 
cision making in data warehouses. We called this architecture active data ware- 
house. We further introduced analysis rules, which are templates for automatic 
decision making in active data warehouses. Analysis rules fire at the occurrence 
of a calendar event or at the occurrence of a relative temporal event, which may 
refer to method invocations in operational systems (i.e., OLTP method events). 
Automatic decision making of analysis rules is accomplished by using (1) the 
analysis graph, which is a hierarchy of cubes and (2) by specifying a set of deci- 
sion steps, which apply decision criteria on these cubes. An analysis rule’s action 
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will be executed in an OLTP system for an entity if the rule considers the entity 
to be “decided” . We have built a proof-of-concept prototype on top of an existing 
relational database system (Oracle 8i), which will be described in a forthcoming 
paper. 
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Abstract. Since data warehousing has become a major field of research there has 
been a lot of interest in the selection of materialized views for query optimization. 
The problem is to find the set of materialized views which yields the highest cost 
savings for a given set of queries under a certain space constraint. The analytical 
perspective results in queries which on the one hand require aggregations but on 
the other hand are quite restrictive with regard to the fact data. Usually there are 
“hot spots”, i.e. regions which are requested very frequently, like the current 
period or the most important product group. However, most algorithms in litera- 
ture do not consider restrictions of queries and therefore generate only views con- 
taining all summary data at a certain aggregation level although the space it occu- 
pies could better be used for other, more beneficial views. This article presents an 
algorithm for the selection of restricted views. The cost savings using this algo- 
rithm have been experimentally evaluated to be up to 80% by supplying only 5% 
additional space. 



1 Introduction 

A very popular technique to increase the performance of data warehouse queries is the 
precomputation of results in the form of materialized views. To gain maximum perfor- 
mance for a given set of queries the “best” set of views which fits into a supplied amount 
of space has to be determined. For a given query behaviour the view selection problem 
can be stated as follows: 

Definition 1: Let Qj= {Q-| , 0^,} be a set of queries with the respective access frequen- 

cies {fi , fn) and let S be a supplied amount of space for materialized views. A solu- 
tion to the view selection problem is a set of views ‘1/= {V-|, ..., with Zj IVjl < S 
such that the total cost to answer all queries C(Q) = Xj f| C(Q|) is minimal.^ 

Several preaggregation mechanisms exist for the view selection problem (see 
section 5). Most of them are based on an aggregation lattice ([HaRU96]). Informally an 
aggregation lattice is a DAG showing the dependencies among the sets of grouping 
attributes (granularities) in a multidimensional database. In the aggregation lattice in 
figure 1 the higher (coarser) nodes can be computed from the lower (finer) nodes. A 
more formal treatment of the relationship between the nodes is given in the next section. 



* The partitioning algorithm which is presented in this article aims at minimizing query cost. An exten- 
sion to include update costs is straightforward as illustrated for example in [BaPT97]. 
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Region | | Area | Country ] | Group 



1 


1 City 1 


1 Area 


1 Region | | Group | Country | 


1 Family | | 




1 1 Shop 1 


1 Area | 


City 1 


1 Group 1 Region | | Family 


1 Country | | Article | | 




1 Area 


1 Shop 1 


1 Group 


1 City 1 1 Family | Region | 


1 Article | Country | 



I Group I Shop ~| I Family I City ~| | Article | Region~| 

I Family | Shop ~| | Article ] City ~| 

I Article Shop ~| 

Fig. 1. An aggregation lattice: The direction of the edges is from lower to higher nodes. 



To illustrate the drawbacks of lattice-based algorithms as well as to simplify our exam- 
ples we will restrict the discussion to multidimensional databases in the sense of 
[BaPT97] represented by a star/snowflake schema ([Kimb96]). The general structure of 
an aggregate query on a star schema (star query) is 

SELECT <grouping attributes>, AGG(<measure>) 

FROM <fact table>, <dimension tables> 

WHERE <join conditions> AND 

<restrictions> 

GROUP BY <grouping attributes> 

We call queries which have restrictions restricted. If there are only join conditions in the 
where clause, the queries are called unrestricted. Lattice-based algorithms like 
[HaRU96] and [BaPT97] treat all queries as unrestricted queries, restrictions are just not 
considered. The queries are mapped to the nodes in the lattice which only reflect the 
grouping attributes. Based on the reference frequencies of each node it is decided which 
of the (unrestricted) nodes are to be materialized. Dealing only with grouping attributes 
allows a relatively simple view selection algorithm. 

But most of the data warehouse queries are focused on certain product groups, loca- 
tions, customer segments and especially time ranges. And there are hot spots with 
respect to these restrictions, too. Usually the current time period is most interesting. If 
for example 80% of the queries concern only the last month, why should one materialize 
an aggregate containing data for the last 5 years as well? Besides these obvious time 
restrictions there are very often “hot” product groups, customer segments and so on. 
Neglecting these hot spots results in two major drawbacks: First, space is wasted on one 
aggregation level which could be used for more interesting regions on other aggregation 
level, and second, the unrestricted materialized views are much larger as restricted 
views. Therefore query runtime increases. 

If the query behaviour is known or can be estimated, a view selection algorithm should 
produce the “best approximation” of the queries under a given space constraint. In the 
next sections we present an algorithm which computes a set of restricted aggregates 
which is “tailored” for a certain set of queries. The basic idea is to select only useful 
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partitions or fragments of each lattice node. Restrictions in data warehouse queries, like 
“video devices in Germany” or “Consumer Electronis in December 1999”, are based on 
the dimension hierarchies. Therefore, the split algorithm is also designed to partition the 
data space along these classification boundaries. 

The rest of the paper is organized as follows: The formal framework needed for our 
method is introduced in the next section. The actual algorithm follows in section 3. Sec- 
tion 4 shows some experimental results. Some related work is mentioned briefly in sec- 
tion 5. The paper concludes with a summary. 



2 Dimensions, Granularities and Queries 

This section introduces some notation which is necessary for the simple description of 
queries and their relationships. 

Dimensions. A fundamental aspect of a multidimensional database is that the items in 
a dimension are organized hierarchically. The structure of a hierarchy is defined by 
dimensional schemas with dimensional attributes like product groups, cities etc. 

Definition: Let denote the functional dependency. A 
dimension schema is a partially ordered set of attribu- 
tes, called dimension levels, ;Z)=[{Di,...,Dn}, — ^] with a 

■y 

least element . 

Figure 2 shows two sample dimensional schemas for prod- 
ucts and locations. Parallel hierarchies like the brand clas- 
sification on the products are also possible. To distinguish 
the attributes from different dimensions we will use the 
notation ®.Dj, e.g. Product. Brand or simply RBrand. 

Granularities. The partial orders on the dimensional schemas induce a partial order on 
the power set of all attributes from all dimensions. The set of grouping attributes of a 
star query, called granularity or aggregation level, is an element of this power set. More 
formal, a granularity is a set of attributes G={g-|, ..., Qn) where g|=©Dj in some dimen- 
sion T> with a partial order based on on the functional dependencies. 

Definition: A granularity G={gi,...,gn} is finer than or equal to a granularity 
G’={gi’,...,gk’}, denoted as G<G’, if and only if for each gj’G G’ there is a gjG G such 
that Gj — ^Gj’. 

This partial order is the basis for the aggregation lattice ([HaRU96], [BaPT97]). For 
example {P. Article, L.Shop} < {P Family, L.City} < {L. Region} (figures 1 and 2). The par- 
tial order can also be used to define a greatest common ancestor of two granularities 
which is unique because of the least element in each dimension schema. 



Product Location 




Fig. 2. Dimension schemas. 



2 



The functional dependency only induces a partial order, if there are no attributes A^B and B^A with 
A7^:B, which is assumed in the following. 
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Definition: The greatest common ancestor of two granularities G-| and G 2 , denoted as 
G-|® G 2 is the granularity G such that 

• G<G-| and G<G 2 and 

• there is no G’ with G’<Gi and G’<G 2 and G<G’. 

Queries. If a single measure with an additive aggregation function ([A1GL99]) and 
lossless joins are assumed, star queries can simply be characterized by the granularity 
and the restrictions denoting the scope of interest. 

Definition: A query Q consists of a granularity G and a scope S, i.e. Q=[G, S]. 

In our examples we assume the scope to be a conjunction of restrictions of the kind ©Dj 
= <value>, for example Product.Group = “Video”. Note, that this is not a prerequisite of 
the algorithm. If an efficient algorithm is used to check containment for more complex 
predicates, e.g. [A1GL99], more general restrictions could be used as well. 



3 The Selection of Aggregate Partitions 

In preparation of the partitition selection algorithm the requested scopes of the queries 
must be reflected in the aggregation lattice. Therefore, the node definition must be 
extended. 

3.1 Building the Lattice 

Because lattice based algorithms face a scalability problem ([BaPT97]), we will not 
introduce more nodes, but just attach the requested scopes and their respective reference 
frequencies to the node. Thus a node can still be uniquely identified by its granularity. 

Definition: A lattice node is defined as a structure N=[G, S, ‘K] where G is its granularity, 
5= {Si , ..., Sn) is a set of scopes and {Rsi , ..., Rs^,} are the reference frequencies 
for the scopes. 

To build the reduced lattice for a set of reference queries Q, the algorithm described in 
[BaPT97] can be used with small changes. The algorithm is illustrated for nodes with- 
out reference frequencies, these are added in the next step (section 3.2). First for each 
granularity Gj appearing in a query a node containing all referenced scopes must be cre- 
ated. If there is already such a node, only the scope of the query is added to the scope 
set of the node. Otherwise the respective node is created. In the second part of the algo- 
rithm for any two nodes the greatest common ancestor with the superset of the scopes 
is added to the lattice. 

3.2 Propagation of Reference Frequencies 

The algorithm for the partitioned view materialization faces a hen and egg problem. On 
the one hand, the nodes must first be partitioned in order to run the selection algorithm. 
On the other hand, the partitioning algorithm must know for the computation of the par- 
titioning scheme which views will be materialized in the upper nodes. Consider the 
nodes A, B and C in figure 3a. B and C represent two queries requesting the scopes Sg 
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and Sc respectively. Node A=[Gb ® Gq, {Sg, Sc}] is the the greatest common ancestor 
of B and C. Because B and C are the only queries, useful partitionings of A might be {Sg, 
Sc^Sg], {Sg\Sc, Scl or {Sg\Sc, Sc\Sg, ScnSg}. 



However, if for 
example B was 
materialized, A 
will never be 
used to com- 
pute B. Thus 
the reference 
frequency of 
Sg\Sc in this 
case will be 
zero and there- 
fore only the 
materialization 
of the partition 









Sc 


Sb 











direct references . 

r 

indirect references 




a) Hen and egg problem: 

How should A be partitioned? 



b) Propagation of 30 references 
to a single scope in node A 



Fig. 3. Propagation of reference freqencies 

for Sc would make sense. But the reference frequencies must be determined before the 
view selection algorithm can be run, because the goal is to materialize those partitions 
which are referenced most frequently ! 



Thus we use a heuristic propagation algorithm which distributes the actual access fre- 
quencies of a node (direct references) to its ancestors (indirect references) to estimate 
the possible references after the view selection. To illustrate this idea consider figure 3b, 
where a certain scope S was referenced 30 times in node N. These thirty references have 
to be propagated to the other nodes, which in the example have zero direct references 
for S. The underlying assumption for the propagation algorithm is that the smaller the 
cardinality (expected number of tuples) of an ancestor node, the more likely this node 
will be materialized. Let be the ancestors of the node N with the direct refer- 

ences R|ig s to scope S. Then the number of indirect references Rn^ g is computed 
n 

bylN,.S = RN.Sjiql^S i . For example we get N^.S = 30'^/^ = 'for 

node N 2 in the lattice. This number is added to the number of direct references of S in 
each ancestor N|< yielding the total number of references which is used for the partition- 
ing algorithm. 



3.3 Partitioning the Lattice Nodes 

In this step each node of the reduced lattice will be recursively split up into partitions. 
The optimal size of the partitions depends on the reference frequencies of the scopes in 
this node. However, accessing many partitions does imply some overhead for their uni- 
fication and this is why it makes sense to define a cost function. This offers the ability 
to compare the cost of accessing several fragments with the access cost of a single table. 
The idea of the split algorithm is then to split the fact table in each node as long as the 
overall access cost of all queries to the fragments gets cheaper. 
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Cost function 

Assume that the node N=[G, {S-| Sp,}, {Rsi, ■■■> Rsnl] was already partitioned into the 

scope partitions (P= {P-|, ...,Pq}^. Then the cost C(S|, T) to read scope S| from P-| Pq 

consists of two parts: scanning costs and unification costs. 

Actually, the formulas for these costs do not matter as long as they are monotonous. In 
correspondence to [HaRU96] we assumed that the scanning cost for reading a partition 
grows linearly with its size. The unification cost grows linearly with the intersection of 
S and the respective partitions because these tuples have to be written to disk tempo- 
rarily. Assuming that S overlaps partitions Pj P|^ we used the following estimate for 

the access cost: C(S, IP) = ^ (a^ • |P|| + bg) + (a^ ■ jPj n Q| + b J 

The constants 83 and (scah^hd update) for the time to scan one tuple and bg and bu 
for the overhead to open and close the table are system dependent. 

The total cost C of accessing all scopes {S-| , ..., Sn) in node N is the sum of the costs for 
each scope weighted by the reference frequencies: 

C(N,1P) = ^ Rs. C(Si,lP) 

Partitioning Algorithm ' = ^ 

Let N=[G, {S-|, ..., Sn), {Rsi Rsnl] be the node to be partitioned. The recursive parti- 

tioning algorithm is shown in figure 4. It is initialized with the call 

Partition(N, 0, TopLevels(G)), 

i.e. it starts with a single partition representing the whole node without any restrictions. 
This partition is then recursively split into smaller pieces by using the semantic grid 
structure given by the classification hierarchies. The split process starts at the highest 
dimension levels. This is expressed by TopLevels(G). For example (figure 2) 



Algorithm: Partition ( N, P, £) 

Input: node N=[G, {Si, ..., Sn), {Ri, ..., R„}] 

the partition P to be split 

set of hierarchy levels L={L-^ L|} 

Output: partitioning T 



1 


Begin 


15 


// if partitioning is cheaper 


2 


t^min “ 


16 


If (Cmin < ~) Then 


3 




17 


// adjust split levels 


4 


// try iteratively to split in each dimension 


18 


£:=i\{Lp}u{Lp-1} 


5 


Foreach level L|6 L 


19 


6 


S’:= split(S L|) 


20 


// run partitioning algorithm recursively 


7 




21 


S’:=0 


8 


// determine partitioning with least cost 


22 


Foreach Pe S’^in 


9 


If (C(N,2>)<C,„i„)Then 


23 


S’:= S>u Partition! N. P 


10 


Cmin := C(N, S’) 


24 


End Foreach 


11 


t^min ■“ ^ 


25 


End If 


12 


p:=i 


26 




13 


End If 


27 


Return S’ 


14 


End Foreach 


28 End 



Fig. 4. Partitioning algorithm. 



S’ is a partitioning of the whole data space, i.e. PjnPj=0 and u, P, yields the whole space. 



3 
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TopLevels({P.Article, S.Shop}) = {P.Area, PBrand, S. Country} and 
TopLevels({P.Family}) = {P.Area}. 

To explain the algorithm we will use the example in figure 5 with the node at granularity 
(PArticle, S.City}, starting with il={P.Area, S. Country} as split levels (brand is omitted 
for simplicity). The illustrated node has possible requests to the scopes S^, S2, S3 and 
S4, which have conjunctive restrictions based on the classification hierarchies, e.g. S4 = 
(P.Family = “HomeVCR” AND S.Region=”North”). 

In the first two steps the node is split at P.Area and the left partition again at S. Country 
to separate those partitions which are never requested (striped). We use the remaining 
part, i.e. the partition P = {(P.Area=”Consumer Electronics” AND S.Country=”Ger- 
many”)} to explain the algorithm. The split level vector is now L = (RGroup, S. Region}. 
Thus, the costs C(N, T’) for the following two partitionings IP are computed (line 6): 

• <Si = {(“Audio” AND “Germany”), ("Video” AND “Germany”)} 

• T 2 = {(“Consumer Electronics” AND “North”), (“Consumer Electronics AND “South”)} 

Using iPj scopes S-| 
and S3 require a scan 
of both partitions. 

With T2 this is only 
the case for S-|. Here 
S2, S3 and S4 can 
each be computed 
from a single parti- 
tion which is, accord- 
ing to the partitioning 
cost computation, 
cheaper. Therefore 
the second scheme is 
applied. After the set 
of the split levels is 
adjusted \a L = 

{PGroup, S.City} (line 
18 ). The algorithm 
now recursively con- 
tinues for each partition in tP^. It turns out that it is not useful to split P^ again. P2 how- 
ever is split again into P3 and P4 (figure 5 ). 

3.4 The Selection of Aggregate Partitions 

A preliminary adjustment is necessary before the actual greedy algorithm can work. 
Consider again scope S-| in figure 5 . If only P-| was materialized the raw data have to be 
accessed for the part covered by P3. If a scan of the raw data is necessary anyway, then 
S-| should better be retrieved completely from the raw data. But in this case P-| would 
not contribute anything for the computation of S-| and the benefit is zero. So P-| only 



0 
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partitioned unpartitioned 

Fig. 6. Experimental results. 

yields a benefit for S-| if also P3 was materialized! To reflect this, for each scope (each 
query) in every node a bucket containing all partitions which are needed to compute this 
scope is defined. The bucket for S-| is for example {P^ , P3}. 

For the selection of view we use now the greedy algorithm presented in [HaRU96]. In 
our case however, the objects to materialize are not unrestricted queries, but partitions 
in a bucket. The beneht is computed on buckets, but if a bucket is selected each partition 
is materialized separately. The cost model is very similar to the one defined in 
[HaRU96]. The benefit of a bucket is calculated as the difference of tuples to be scanned 
to answer all queries with and without the partitions in the bucket materialized. 

The algorithm chooses the unmaterialized bucket with the highest benefit based on the 
already selected buckets. This step is repeated until the available space for preaggre- 
gates is exhausted. 



4 Experimental Results 

Because the algorithm defined in the previous section works on aggregate partitions, the 
number of blocks to be read in order to answer a given set of queries should drastically 
decrease. In this part we will demonstrate the efficiency of our method and compare it 
to the results of the algorithm of [BaPT97]. 

We used the scenario of the APB-1 benchmark of the OLAP-Council [OLAP98] in a 
slightly modihed version. The model has four dimensions and five measures. The fact 
table contained about 2.7 million tuples at a sparsity of 0.1%. The measurements were 
performed on Oracle 8i on a Sun 4000. The system parameters a and b for the partition- 
ing cost function (section 3.3) were determined by several tests in advance. 

In the test the input to the algorithms were 100 queries independently requesting arbi- 
trary data sets. Therefore, we do not have a hot spot, but several hot regions. As illus- 
trated in hgure 7a, the number of blocks read drops by more than 80%. However in this 
case approximately 50.000 tuples (still only about 2% of the fact table!) were supplied 
for materialized aggregates. The borders of the buckets can be clearly identified by the 
steps in the figure. If not all partitions of a bucket are materialized no further beneht can 
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be achieved. If it is completely materialized the number of blocks and the time needed 
to answer the queries drops instantly. 

To compare our algorithm with the unpartitioned lattice-based method of [BaPT97], we 
used the same set of queries and disabled the partitioning. The result is shown in 
figure 6h. As can be seen about seven million tuples must be materialized to reduce the 
number of blocks read by 40%. 

We did not use indexes or partitioning on the fact table in the experiments. Otherwise 
the performance gain would not be so dramatic. But still the results would he much bet- 
ter as with former lattice based algorithms. 



5 Related Work 

In the last few years there have been several influential articles on the view selection 
problem. Of fundamental importance for this article was the work on aggregation lat- 
tices of [HaRU96] and [BaPT97]. [GHRU97] additionally considers existing index 
structures. [Gupt97] generalizes the aggregation lattice in the form of AND/OR-graphs 
by considering alternative local query execution plans. An efficient simplification of 
[HaRU96] is the method of [ShDN98]. The method also proposes a partition scheme for 
the lattice nodes. However, all partitions of the selected nodes in the lattice are com- 
pletely materialized. Moreover, the partitioning scheme is physical (in terms of disk 
pages) and not logical which is more useful in the presence of restriction predicates. 

Similar to our algorithm is the one presented in [EzBa98]. Here the predicates in the 
restrictions of queries are resolved into minterms. However, the number of minterms 
grows exponentially with the number of queries. A drawback of the algorithm is that the 
minterms have no relationship to the dimension hierarchies. Therefore, the partitioning 
scheme will have a relatively arbitrary structure which might even be disadvantegeous 
for queries not in the reference set. The algorithm of [YaKL97] works on relational 
algebra expressions. It computes a multple view processing plan for a set of queries and 
selects subqueries with the highest number of references for precomputation. 

The selection of views is not the only problem. Once views are materialized it has to be 
determined which views are the best to be used to answer a certain query. Coping with 
the view usage problem are for example the methods presented in [SDJL96], [CoNS99], 
[A1GL99]. 

6 Summary 

The article presents a new algorithm for the selection of materialized aggregates. By 
taking into account the restrictions of the queries to be optimized it is possible to come 
up with a much better approximation of the queries as in previous approaches. There- 
fore, the algorithm computes with the same space limit more, but smaller aggregates. 
No space is wasted, regions which are never queried are not materialized. The perfor- 
mance gains are up to 90% in terms of the number of disk blocks. 
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The partitioning algorithm can also be used to partition the raw data in the fact table 
without the goal to materialize views, because what the algorithm actually does is find 
the optimal partitioning of an arbitrary table for a given set of queries. This table does 
not necessarily have to represent an aggregated view. 
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Abstract. Index selection is one of the most important decisions in 
designing a data warehouse (DW). In this paper, we present a framework 
for a class of graph join indices used for indexing queries dehned on 
materialized views. We develop storage cost needed for these indices, 
and query processing strategies using them. We formulate the graph join 
index selection problem, and present algorithms which can provide good 
query performance under limited storage space for the indices. We also 
evaluate these algorithms to show their utilities by using an example 
taken from Informix white paper. 



1 Introduction 

In recent years, there has been an explosive growth in the use of databases 
for decision support systems which aims to provide answers to complex queries 
posed over very large databases. Data warehousing is a “collection of support 
technologies aimed at enabling the knowledge worker to make better and faster 
decisions” [2]. DWs are designed for decision support purposes and certain large 
amounts of historical data. For this reason, DWs tend to be extremely large [2]. 
The OLAP queries (joins and aggregations) are very expensive in computation 
and they can take many hours or even days to run [6]. This is because queries 
are performed on tables having potentially millions of records. The join opera- 
tion of the relational database model is the fundamental operation that allows 
information for different relations to be combined. Joins are typically expensive 
operations, particularly when the relations involved are substantially larger than 
main memory [8, 15]. The star join index (SJI) [13] has been used in DWs mod- 
eled by a star schema. It denotes a join index between a fact table and multiple 
dimension tables. The SJI has been proved to be an efficient access structure for 
speeding up joins defined on fact and dimension tables in DWs [13, 9]. 

Another technique that can be used to speed up the query processing is 
materialized mews [5, 14, 16]. The problem of selecting a set of materialized 
views for a given set of OLAP queries is called mew selection problem. It starts 
with a DW schema with dimension tables and fact table (s), and then selects 
a set of materialized views to speed up a set of OLAP queries under some 
resource constraint (like disk space, computation time, or maintenance cost). 

Y. Kambayashi, M. Mohania, and A M. Tjoa (Eds.): DaWaK 2000, LNCS 1874, pp. 57-66, 2000. 

© Springer-Verlag Berlin Heidelberg 2000 




58 L. Bellatreche, K. Karlapalem and Q. Li 



Once materialized views are selected, all queries will be rewritten using these 
materialized views (this process is known as query rewriting [12]). A rewriting 
of a query Q using views is a query expression Q' that refers to these views. In 
SQL, after rewriting process of OLAP queries, we can hud in the FROM clause 
a set of materialized views, dimension tables and the fact table. These views can 
be joined each others or with other tables. 

The problem of selecting a set of indices to speed up a given set of queries is 
know as index selection problem. It starts with a DW schema, and then builds 
indexes on top of this schema to speed up a set of OLAP queries under some 
resource constraint. Among the selected indices, we can hud indices dehned on 
the top of a single table and view, and join indices dehned on a fact table and 
dimension tables (e.g., SJI). Indexing a single materialized view can be done by 
using the same indexing techniques for a table (e.g., B"*"-tree). However, there 
has not been much research reported in enhancing SJI algorithms for efficiently 
selecting and performing join indices with materialized views. In this paper, we 
consider the problem of join index selection in the presence of materialized views 
(PJISV). 

1.1 Contributions and Organization of the Paper 

To the best of our knowledge, this is the hrst paper studying the concept of 
PJISV in DW environments. The main contributions of this paper are that we: 

— present a novel indexing technique called graph join index to handle the 
PJISV. This index can be constructed using dimension tables, materialized 
views and a fact table. This has been discussed in Section 2. 

— formulate the PJISV and present algorithms for selecting an optimal or near 
optimal indices for a given set of OLAP queries. This has been discussed in 
Section 3. 

— show the utility of PJISV through evaluation. This has been discussed in 
Section 4. 



1.2 Related Work 

The problem of indexing materialized views was hrst studied by Roussopoulos 
[10]. A view index is similar to a materialized view except that instead of storing 
tuples in the view directly, each tuple in the view index consists of pointers to 
the tuples in the base relations that derive the view tuple. But the author did 
not consider the problem of join indexing materialized views. Choosing indices 
for materialized views is a straightforward extension of the techniques used for 
indexing tables. 

Segev et al. [11] presented a join index called join pattern index in rule envi- 
ronment where changes in the underlying data are propagated immediately to the 
view. This index exists an all attributes involved in selection and join conditions. 
It refers to an array of variables that represent the key and join attributes of the 
data associated with the join. However, in DWs, a large number of changes are 
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propagated at once, and the cost of maintaining the indexes often out-weights 
any beneht obtained by doing join indices. So it is not correct to assume that 
indices exist on all attributes involved in selection and join conditions [7]. 

1.3 Motivating Example 

We present an example of a DW schema models the activities of a world-wide 
wholesales company, and will be used as a running example through this paper. 
This schema consists of three dimension tables CUSTOMER, PRODUCT, and 
TIME, and one fact table SALES. The tables and attributes of the schema are 
shown in Figure 1. 



Fig. 1. An Example of a Star Schema 

Assume that we are interested in determining the total sales for male cus- 
tomers purchasing product of type package “box”. The following SQL query Qi 
may be used for this purpose: 

SELECT SUM(S.dollar_sales) 

FROM CUSTOMER C, PRODUCT P, SALES S, 

WHERE C.Cid = S.Cid 

AMD P.Pid = S.Pid 

AMD C. Gender =‘M’ 

AMD P . Package_type = ‘‘Box’’ 

GROUP BY PID 

This company also maintains the following materialized view that consists in 
Ending the total sales for each product having “box” as type of package. This 
view is defined as follows: 

CREATE VIEW V_2 
SELECT + 

FROM PRODUCT P, SALES S, 

WHERE P.Pid = S.Pid 

AMD C . Package_Type =‘‘Box’’ 

GROUP BY PID 
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The view V 2 can be used to evaluate the query Qi by joining V 2 with the di- 
mension table CUSTOMER. The rewritten query Q[ that uses V 2 is: 

SELECT SUM(S.dollar_sales) 

FROM CUSTOMER C, V_2, 

WHERE V_2.Cid = C.Cid 

AMD C. Gender =‘M’ 

GROUP BY PID 

Note that the fact table SALES is very huge, and the materialized view V 2 is 
likely to be orders of magnitude smaller than SALES table. Hence, evaluating 
Q[ will be much faster than evaluating the query Qi. This is because, Qi needs 
two join operations, but Q[ needs only one. There is a solution to reduce the 
number of join operations of the query Qi which is the SJI [13]. The suitable 
SJI for Qi is (CUSTOMER, SALES, PRODUCT), but it requires a very high 
storage capacity [4] that can slow down the execution of the query Qi. A join 
index [15] between the view Vj and the dimension table CUSTOMER can speed 
up the execution of the query Qi, and its size is much smaller than the SJI 
(CUSTOMER, SALES, PRODUCT). 

2 Notations & Definitions 

Definition!. (Database Schema Graph (DBSG)) is a connected graph 
where each node is a relation and an edge between two nodes Ri and Rj implies 
a possibility of join between relations Ri and Rj. 

Definition2. (Join Graph (JG)) is a database schema graph such that there 
exists exactly one join condition between two relations. In DW environments, a 
JG is the same as a DBSG. 

Claim 1 A .JG m a DW with only dimension tables and a fact table is always 
connected. In this case a JG is called total join graph. 

Claim 2 In a DW context with dimension tables, a fact table and materialized 
mews, a JG can be disconnected, i.e., it contains more than one subgraphs. In 
this case it is called partitioned join graph. 

The partitioned join graph occurs when: (1) views do not have common join 
attributes, (2) views have common join attributes, but they can never be joined. 

Definitions. (Graph Join Index (GJI)) A GJI is a subset of nodes of a JG. 

In case of distributed join graph, a GJI is a subset of nodes of each sub-graph. 

A DW schema consists of a star schema (a set of dimension tables and a fact 
table) with a set of materialized views. Each table R being either a dimension 
table, a view, or a fact table has a key Kt,- A GJI on m tables is a table having 
m attributes {Kti, Kt 2 j •••; and is denoted as : 

(Ti ~ T2 ~ ... ~ T„). 
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Since the star schema is connected, there is a path connecting the tables 
{T\,T2, . . . ,Tm) by including additional tables, if necessary. The GJI supplies 
the keys of the relations (Ti , T2, . . . , T^) that form the result of joining all 
the tables along the path. In case there are multiple paths that connect tables 
{T\,T2, . . then there can be multiple GJIs, each catering to a different 

query. A GJI with all tables of a JG is known as a complete GJI. A GJI with a 
subset of nodes of a JG is referred as partial GJI (see Figure 2 ). 




Fig. 2. Join Graph and Graph Join Indices 



Note that for a star schema with n tables, the number of possible GJIs grows 
exponentially, and is given by: 




3 GJI-Selection (GJI-S) Schemes 

In this section, we formulate the GJI-S problem and we present GJI-S algorithms. 
The GJI-S problem can be viewed as a search problem where the search space 
consists of all possible indices for a given set of queries. Since the search space 
is very large ( 2 ^^ -n-i)^ tables), we need to have algorithms to hnd the 

optimal or the “near optimal” GJIs for processing a set of queries. 

3.1 GJI-S Problem Formulation 

The inputs of our problem are as follows: 

(i) A star schema with the fact table F and dimension tables {Di, D2, ..., Dd\, 

(ii) a set of most frequently asked queries Q = {Qi, Q2, ■■■, Qs} with their fre- 
quencies {/i, /2, ..., /s}, (iii) a set of materialized views V = {Vj, V2, •••, fm} 
selected to support the execution of queries in Q, and (iv) a storage constraint 

S'. 

The GJI-S problem selects a set of GJIs G' , which minimizes the total cost 
for executing a set of OLAP queries, under the constraint that the total space 
occupied by GJIs in G' is less than S. 

More formally, let Gg{Qi) denotes the cost of answering a query Qi using 
a GJI g {g ^ G'). The problem is to select a set of GJIs G' {G' Q G) ^ that 

G represents all possible GJIs for a star schema. 



3 
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minimizes the total query processing cost (TQPC) and under the constraint 
that: Sg' < S, where: TQPC = X^i=ifi ^ {minvggGCg(Qi)}. 

3.2 Graph Join Index Selection Algorithms 

All index selection algorithms select indices to optimize the objective function 
employed, subject to the constraint that the indices included cannot consume 
more than the specihed amount of storage. Our objective function is based on 
two cost models that estimate the number of disk accesses (lOs) to answer a 
set of queries in the presence of GJIs with and without materialized views. In 
order to elaborate these cost models, we consider the following assumptions: (1) 
dimension tables, the fact table, materialized views and indices are stored on 
the disk (our cost model can easily be adapted when dimension tables, the fact 
table, materialized views or indices are in the main memory), and (2) selection 
conditions are always pushed down onto the dimension tables and the fact table 
as in [7]. 

Due to lack of space, we omit the description of our cost models. For details, see 
[1] in which we have also estimated the storage cost needed for the GJI structure. 

Naive Algorithm (NA) is used for a comparison purpose. The NA enumerates 
all possible GJIs, and calculates the contribution of each GJI. Finally, it picks 
the GJI having the minimum cost while satisfying the storage constraint. 



Greedy Algorithm for Selecting one GJI We have developed a heuristic 
algorithm (GAl) for selecting only one GJI. Given a storage constraint S, our 
GAl constructs a weight GJI graph from the GJI graph by adding a weight to 
each edge which equals to the sum of query frequencies using that edge. Note 
that each edge represents a join index between its two nodes. The GAl starts 
with the simplest join index that corresponds to most frequently executed join 
among all the queries, and tries to expand it by adding more nodes while checking 
if the total processing cost decreases. Once no more expand operations can be 
applied, the algorithm tries to shrink it to check if a better GJI can be derived. 
The GAl keeps repeating these two operations (expand and shrink) till no more 
reduction in query processing cost can be achieved without violating the storage 
cost. 



Greedy Algorithm for Selecting more than one GJI Note that as a single 
GJI generated by GAl can only efficiently execute some but not all of the queries, 
we need more than one GJI to efficiently execute all the queries. Therefore, we 
have developed an algorithm GAK that within the storage constraint hnds at 
most 2" — n — 1 GJIs which can reduce the query processing cost as much as 
possible. The GAK algorithm starts with the initial solution provided by GAl 
as the hrst index. After that it selects the edge from weight graph that has the 
highest query access frequency and which is not a GJI generated by GAl. Then it 
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tries to expand (one or more times) and shrink (one or more times) till no further 
reduction in total cost can be achieved and the storage constraint is satished. 
This generates the second GJI, and GAK keeps repeating this procedureuntil at 
most (2" — n — 1) indices are found or the storage constraint is violated. 

4 Evaluation of Graph Join Indices 

Our experimental study utilized a DW schema and data set from [3]. The number 
of rows and the width (in bytes) of each table are given in Figure 1. The page 
size (in bytes) that we use for our evaluation is 8192. 

We consider hve frequently asked queries [1]. To support the execution of 
these queries, we select by using the Yang et al. algorithm [16] two views Vi and 
V 2 to be materialized such as: 

- Vi = a state=“iL" (CUSTOMER) M SALES, and 

- V 2 = crpackage="Box" (Product) M SALES. 

The matrix in Figure 3 represents the rewriting matrix. It also gives the fre- 
quency (FR) with which each query is executed. 



VI V2 T C FR 

Q1 / 1 0 0 0 

Q2 0 1 1 0 5 

Q3 0 1115 

Q4 1 0 1 0 10 

\ 0 1 0 1 15 2 



Fig. 3. Rewriting Matrix 



Since the queries can directly utilize these materialized views, we hrst evalu- 
ate whether indexing the views and dimension tables leads to any reduction of 
query processing cost. Table 1 enumerates all 11 possible GJIs (for n = 4 nodes). 
The GJIs are built over dimension tables TIME (T) and CUSTOMER (C), and 
views Viewi (Vi) and View 2 (V 2 ). For example, in Table 1, GJI (Vi ~ V 2 ) is 
between views Vi and V 2 , whereas the GJI (Vi ~ T ~ G) is between view Vi 
and dimension tables TIME and CUSTOMER, and the GJI (Vi ~ V 2 ~ ~ G) 

is the complete GJI covering views Vi and V 2 , and dimension tables TIME and 
CUSTOMER. Note that the cardinality of views Vi and V 2 are 200, 000 and 
400, 000 rows, respectively. Table 1 also lists the number of rows for GJIs. 

Referring to Table 2, one can see that the GJI (Vi ~ T) has the least storage 
cost 81504731 bytes, but it gives the least query processing cost for only query 
Q 4 . Note that the GJI (Vi ~ V 2 ~ ~ G) gives the least query processing cost 

for queries Qi, Q 2 , Q 3 and Q 5 , but not for query Q 4 . This is because Q 4 needs 
to access only Vi and TIME, and the GJI (Vi ~ T) hts this query exactly. The 
storage requirement for (Vi ~ V 2 ~ ~ G) is of 293778358 bytes (about 0.3 

GBs). Note that there is no single GJI which gives the best query processing 
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Index 


Viewi (Vi) 


View2 (V 2 } 


TIME (T) 


CUSTOMER (C) 


9 ^ Rows 


Vi 


1 


1 


0 


0 


4000000 


Gi ~ T 


1 


0 


1 


0 


2000000 


Gi ~ C 


1 


0 


0 


1 


2000000 


Vi^V2^T 


1 


1 


1 


0 


4000000 


tq ^ ~ C 


1 


1 


0 


1 


4000000 


tq no T ~ C 


1 


0 


1 


1 


2000000 


tq no tq ~ T ~ C 


1 


1 


1 


1 


4000000 


tq ~T 


0 


1 


1 


0 


4000000 


tq ~ C 


0 


1 


0 


1 


4000000 


tq ~ T ~ C 


0 


1 


1 


1 


4000000 


C 


0 


0 


1 


1 


9846000 



Table 1. Set of GJIs and their Sizes 



cost for all the queries. When indexing materialized views, one can constrain 
the storage requirement even further. That is, by putting a storage availability 
of just 0.23 GBs for GJIs, the best possible index (V 2 ~ T ~ G) requires just 
91821 page accesses as against 510669 page accesses for the hash join algorithm 
(HJA). 



Index 


Storage 


Qi 


Q2 


Qs 


Q4 


Qs 


TQPC 


iq 


197751734 


21729 


41522 


106343 


21729 


106328 


297652 


iq ~ T 


81504731 


21729 


118683 


183504 
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151248 


475800 
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107653595 


21729 


118656 


141254 


21729 


119560 


422928 


tq ~ iq ~ T 


232690614 


21729 


1158 


104431 


663 


106328 


234310 


tq ~ iq ~ C 


258839478 


21729 


41522 


41522 


21729 


2531 


129034 


tq ~ T ~ C 


125159899 


21729 


118683 


118683 
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119560 


379292 


tq ~ iq ~ T ~ C 


293778358 


21729 


1158 


1158 


663 


2531 


27240 


tq ~T 


162853814 


21729 


1158 


104431 


65244 


106328 


298891 


tq ~ C 


189002678 


21729 


41522 


41522 


65190 


2531 


172495 


tq ~ T ~ C 


223941558 


21729 


1158 


1158 


65244 


2531 


91821 


T ~ C 


97060186 


21729 


118697 


118697 


65231 


119580 


443935 


Gomplete SJI 


4967170048 


23144 


31595 


31595 


23147 


45491 


154973 


Hash Join 


0 


21729 


118656 


153846 


65190 


151248 


510669 



Table 2. GJl Storage Gost and Query Processing Gosts 



Table 3 gives the normalized query processing costs for queries Qi to Q 5 
using different GJIs and a normalized total cost. The normalization is done with 
respect to HJA for processing the five queries. The complete GJI (Vj ~ I /2 ~ 
T ~ G) has the cost of processing queries Qi, Q 2 , Q 3 , Q 4 , and Q 5 ranging from 
just 1% (for query Q 3 ) to 100% (for query Qi) of the HJA cost for those queries. 
Note that the least cost for processing query Qi is 100% of the HJA cost. For 
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queries Qs and Q 5 , almost all GJIs (except Vi ~ T) provide cost much lesser 
than the HJA cost (ranging from 1% to 119% for Q 3 , and 2% to 100% for Q 5 ). 
Therefore, there is a signihcant cost reduction in processing queries with GJIs 
in comparison to the HJA approach. 



Index 


Qi 


Q2 


Q3 


Q4 


Qs 


TQPC 


Vi 


1.00 


0.35 


0.69 


0.33 


0.70 


0.58 


G ~ T 


1.00 


1.00 


1.19 


0.01 


1.00 


0.93 


G ~ C 


1.00 


1.00 


0.92 
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0.79 


0.82 


G ~ G ~ T 
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0.01 
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0.27 
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0.02 


0.25 
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0.01 


0.79 
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0.01 


0.01 


0.02 


0.05 


G ~T 


1.00 


0.01 


0.68 


1.00 


0.70 
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G ~ C 


1.00 


0.35 


0.27 


1.00 


0.02 


0.33 


G ~ T ~ C 


1.00 


0.01 


0.01 


1.00 


0.02 


0.17 


T 


1.00 


1.00 


0.77 


1.00 


0.79 


0.86 



Table 3 . Normalized Query Processing Costs 



5 Conclusions 



Index selection is a very important problem in data warehouse design. In this 
paper, we have presented the graph join index (GJI) mechanisms for efficiently 
processing queries in a data warehousing environments. GJIs can be uniformly 
applied to both fact/dimension tables and materialized views, and facilitate pro- 
cessing of queries both with and without materialized views. We have developed 
the storage, query processing, and update cost models for GJIs, and furthermore 
evaluated the GJIs in terms of their storage cost and query processing efficiency. 
We have also developed, for a given set of queries with limited storage availabil- 
ity, two greedy GJI selection algorithms for both getting only one GJI and more 
than one GJIs. These algorithms are tested against an example data warehous- 
ing environment and are found to give the optimal solution (with respect to the 
one derived from exhaustive search). 

From our results, we hnd that having GJIs (whether partial or complete) 
provides better performance than a two-node join index while incurring quite 
reasonable extra storage overhead. 

Future work involves developing efficient query processing strategies using 
GJIs, and further evaluation and development of index selection algorithms using 
other techniques, such as index merging and simulated annealing. 
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Abstract. In this paper we propose adaptive data warehouse mainte- 
nance, based on the optimistic partial replication of base source data 
that has already been used in deriving view tuples. Our method reduces 
local computation at the view as well as communication with the outside 
sources, and lowers the execution load on the base sources, which leads 
to a more up-to-date state of the data warehouse view. 



1 Introduction 

Data Warehouses integrate and summarize information from one or several base 
sources in a separate storage, the materialized view. User queries are directed to 
the materialized view for fast response time, while updates are sent to the base 
sources and eventually propagated to the materialized view to avoid staleness of 
the data. After initialization, the computation at the materialized view executes 
two different tasks. First, the maintenance module refreshes the warehouse view 
by either re-deriving the whole view, or by incrementally computing and inte- 
grating the source updates with the current view [7, 11, 6, 1]. As base sources 
can be distributed, an important issue during view maintenance is to reduce 
the communication between the view and the underlying sources. Note that the 
update effects on the view may be related not only to the base source changes, 
but also to the current state of other sources. Several update queries are then 
necessary to gather all this information. 

Second, based on the data already stored at the materialized view, user 
queries are computed and answered. As a result, the choice of a maintenance ap- 
proach directly affects the speed and the freshness of answers to user queries. Sev- 
eral maintenance approaches for materialized views have been proposed, ranging 
from full replication of the base source data at the view to the materialization 
of only the query result tuples. Within this range, self-maintenance algorithms 
[3, 5] use key and referential integrity constraints to derive a subset of the base 
source data and maintain a select-project-join view without accessing the out- 
side sources. These methods derive maintenance expressions and eliminate the 

* This work was supported in part by NSF under grant numbers CCR-9712108 and 
EIA-9818320 and IIS98-17432 and IIS99-70700. 
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ones that cannot lead to changes at the view. However, this is a pessimistic ap- 
proach that requires the storage of all the data that can possibly be useful for 
the view’s maintenance. Other methods log a history of the updates or auxiliary 
information on certain intermediate results [10, 2]. These cases can be considered 
as pessimistic methods, since they store all the data that can possibly be useful. 
This can lead not only to a significant storage overhead, but, more importantly, 
a large volume of data may delay access to the auxiliary view information during 
maintenance. 

By contrast, the edge fitting solution we develop in this paper uses the op- 
timistic replication of a subset of information from base sources. That is, we 
propose to store from each base source only the information already accessed 
during previous view maintenance, and no computational overhead is needed to 
derive it. Our optimistic method that reduces the amount of auxiliary informa- 
tion with respect to pessimistic approaches is driven by the necessity to speed up 
the access to this information and therefore the maintenance task of of the view. 
Since we reduce the probability of accessing base sources during view mainte- 
nance, the sites storing base information can execute more user updates/queries 
rather than provide view update computation and data. 

In Section 2 we present the data warehouse maintenance problem in a dis- 
tributed system. We further develop the notion of a View Derivation Graph 
(Section 3) and describe a technique called edge fitting for incremental mainte- 
nance of data warehouses. In Section 4, we present the details of the proposed 
algorithm for incremental view maintenance. In the conclusion (Section 5) we 
present our direction for future work. 

2 The Model 

A distributed data warehouse model consists of one or more independent views 
defined over multiple base sources. Communication between two sites is assumed 
to be both FIFO and reliable. Views derive data from base sources by using SPJ 
operations as well as an aggregate function g where g € {COUNT, SUM}. 
Note that this set of aggregate functions is easily implemented by storing only 
a limited amount of information related to the edges of the view derivation 
graph structure that we propose in this paper. To support an extended set 
g G [MIN, M AX, AVG, COUNT, SUM} we need access to base source infor- 
mation, procedure whose analysis is beyond the goal of this paper. An attribute 
is preserved at the warehouse view V if it is either a regular attribute or an 
aggregate attribute in the view’s schema. Projections and selection conditions 
involving attributes of a single source table are pushed down to the correspond- 
ing source. Also, for the purpose of this paper, we assume a tree ordering of 
the join conditions between different tables, without cycles or self-joins. How- 
ever, our method supports full, right or left outerjoin. Our algorithm therefore 
handles a large set of view definitions that in practice are based on the star 
or snowflake schemas. For simplicity, in some of our examples we describe the 
computation over the base sources simply as the join operation. 
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A data warehouse should be initialized, and then maintained up to date with 
respect to the underlying source data. Let a view V be defined over base sources 
- ■ ■ ,B„ as V = B\ \A B2 Bn- An update ABi is propagate from 

source to V. Then the maintenance task of the view is to compute the effects 
of ABi by undergoing an update phase. An update at F is a function of the 
change ABi, as well as of the data stored at the base sources. The necessity for 
communication with several distributed base sources during view maintenance 
may introduce update delays and lead to staleness of data. Moreover, since base 
sources both receive and integrate their own updates and answer update queries 
of the dependent data warehouse views, the communication with the base sources 
can become a bottleneck in the system. It is therefore essential to reduce not 
only the number of update queries but also the number of the base sources 
involved in an update phase. The challenge we are addressing in this paper 
is therefore twofold: to reduce communication, i.e., reduce the possibility of the 
view maintenance task being delayed due to source unavailability /failures as well 
to reduce the amount of storage necessary at the data warehouse. We believe 
therefore that it is essential to develop adaptive maintenance methods that 
optimistically store only the information that is expected with high probability 
to be necessary for future updates. Although such a method does not eliminate 
completely the necessity for communication with the outside sources, it can 
greatly reduce it. 

In this paper we present edge fitting, a graph-based method that stores in- 
formation related to the path that leads to the materialization of tuples, i.e., 
attribute values that lead to the generation of tuples in the warehouse view. 
Although we present a graphical representation of dependencies, the adaptive 
maintenance method is not dependent on the structure of this auxiliary infor- 
mation. For example, the view site may store for example in relational tables 
subsets of base source data previously used for the derivation of the view. In 
this case, the computation overhead of the join operation is still reduced, due to 
the lower number of tuples stored. Storing the graph imposes the data structure 
overhead, but the necessary tuples are accessed faster. 

3 The Edge Fitting Method 

The point of this paper is not how auxiliary data is used, but what information 
we can store without imposing additional computation overhead. We propose 
the edge fitting method, that reuses only the base data that the warehouse view 
accessed and/or computed during initialization and previous maintenance steps. 
We further show how this data can be easily used to improve the efficiency 
of the future maintenance task. Assume a source update is the insertion of a 
tuple whose attribute values involved in SPJ conditions are already part of the 
auxiliary data, i.e., have been used in deriving view information. Then the update 
is simply fit with the previously computed derivation paths, and the final result 
corresponding to the data stored by a materialized view is changed. If only part 
of the attribute values of the new update match the auxiliary path information. 
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then communication is needed to calculate the effect of the remaining attributes. 
However, if the derivation paths make up a dense graph, this subset is very small, 
and the probability of communicating with a source is much lower than in the 
case of the basic incremental view maintenance approach with no auxiliary data. 



3.1 Structuring the View Data with Derivation Graphs 

To illustrate the edge fitting method and the following definitions, consider 
the following example. A data warehouse view derives information from base 
source tables Bi.B^ and B3, with attributes (A,B), (B,C,D) and (D,E) respec- 
tively. We show all tuples stored at the base relations in Figurel. View V is 
defined as Y\ae ^1 N B2 N B3, a set of tuples (A,E), where A and E are the 
attributes preserved at the view V . Let a view V derive data from base sources 
Bi, - ■ ■ ,Bi, - ■ ■ ,Bj, - ■ ■ ,Bn- Without loss of generality, assume that the index of a 
base source corresponds to the order of execution of the view derivation. Then we 
denote a node relationship to be = {vi-^.l\{tuple {vi, ■ • • € 

Bi) A {tuple {v *+i •••) G where n, is a value of the primary attribute of 

and Vi+i is a value of the join attribute of B, and B,_|_i }. If the set nBi,Bi+i {vi) is 
null, then there is no join operation that directly relates tables B, and Bj. Then 
nBi,B2(^) = {2,3} where the primary attribute is A, and the joinable attribute 
is B; nsj,B2(2) = {4}; = {4,2}; nBi,B2i^) = {}; nB2,Bsi^) = {1,4} 

where the primary attribute is B and the join attribute is D; ub2,B3 (4) = {4, 3}; 
nB2,BsiS) = {3}. 
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(2,3,1) 


(5,1) 









Fig. 1. View Derivation Paths for DP{vi = 1) 



The concept of view derivation paths expresses the joinability across all 
sources. A view derivation path is a graphical illustration of joinability instances 
only across tuples at the base sources that useful in the generation of tuples 
in the materialized view. Let Bi • • • B„ be the base sources accessed by V and 
indexed according to the execution order during the instantiation of V. 
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Definition 1 A View Derivation Path DP{v\) for a value vi of the primary 
attribute of B\ is a graph G(V, E) with V as a set of vertiees and E a set of 
edges defined as follows: 

1. An edge e G E represents the same-tuple relationship between two attributes. 

2. Nodes express the joinability of tuples, and eorrespond to values of join at- 
tributes between tuples. A node has only outgoing edges if it eorresponds to 
a primary attribute, and it has only ineoming edges if it represents a view 
preserved attribute. In general, a node Vi eorresponds to the value of the join 
attribute between tuples represented by the ineoming and outgoing edges. 

Figure 1 represents the different View Derivation Paths corresponding to 
DP(v\ = 1), for primary attribute A, join attributes B,D and view preserved 
attributes A,E. 

Finally, we introduce the derivation graph, which traces all derivation paths 
from the primary to all the preserved attributes. Assuming that common edges 
are indistinguishable, which is a valid assumption for a large set of applications, 
then multiple common paths are collapsed into single paths. The information 
stored at the view is used for data warehouse queries, and therefore the view 
tuples should be readily available. In order to avoid an exhaustive search on 
the derivation paths, we add to the view derivation graph a set of direct edges 
between the primary and the view preserved attributes (in the example attributes 
A and E respectively). 

Definition 2 View Derivation Graph 

DG = [jv^ePrimaryAttributeiB^) DP(vi) where values Vi are attribute values for 
the primary attribute of B\. An edge (ui,u„+i) exists if there is a derivation 
path DP(vi) from vi to the last (aeeording to the base souree indexing order) 
view preserved attribute. 




Fig. 2. Derivation Graph before the new updates 



The derivation graph (Figure 2) corresponding to our example, represents 
all the possible view derivation paths from Bi to B„. Nodes are join attribute 
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values of the different base relations, (B,D), together with the primary (and 
view preserved) attribute of Bi and the view preserved attribute of B^, i.e., 
{A,E). An edge represents a join pair between two relations. 

Next, we will show the effects of an update on a materialized view. We assume 
that an update is either an insert or a delete operation. Modifications to a tuple 
in the data sources are propagated at the view as a delete followed by an insert 
operation. 

3.2 Integrating Updates: Inserts 

Base sources can be updated, and as a result the derivation graphs should reflect 
this changes. To continue our example, assume that source B 2 incorporates, con- 
secutively, the updates shown in Figure 3. Figure 3 also represents the derivation 
graph after the view integrates these new updates. For clarity, we omit the direct 
edges ((ui,n„+i)) between the primary and measure attributes. 



A B D E 




Bl 


B2 


B3 


AB 


BCD 


DE 




insert(3,2,l) 

insert(3,2,5) 

insert(6,3,6) 





Fig. 3. Update Derivation Graph after the new updates 



In general, an update tuple is of the form {vi, • • ■ Vi • • •). For each join at- 
tribute value Vi we can consider however one by one the reduced tuple {v\,Vi). 
Then let an update to a base source have a value left^aiue of the attribute join- 
able with Bi-i and another attribute value right^aiue joinable with source 
For our example, if ABi= AB 2 = (3,2,1), then left^aiue = 3 and right^aiue = 
1. If ABi= AB 2 = (3,2,5), lefUaiue = 3 and right^aiue = 5, while if ABi= 
AB 2 =(6,3,6), then left^aiue = 6 and right^aiue = 6. An update falls in one of 
the following three cases: 

Case 1: {left^aiue G DG) A {right^aiue G DG). The insertion of the first update 
AB 2 = (3,2,1), generates a single edge in the graph (Figure 3). This is because 
the join attribute values of AB 2 have already participated in the generation 
of tuples at the materialized view. In general, the insertion of an edge in this 
case either generates one or no new edge in the graph. Communication with the 
external sources is not necessary. 

Case 2: {left^aiue G DG) V {right^aiue G DG). The second update, AB 2 = 
(3,2,5), illustrates the case when only one of the join attribute values, B, is part 
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of the previously stored derivation graph. Since the value of the attribute D 
does not correspond to any value previously used in the join with tuples in , 
computation is needed for the generation of the remaining derivation path. Note 
that, if the number of base sources involved in the derivation of a view is large, 
then there may also be the case that the portion of the path to be added to the 
derivation graph also merges with the previously stored graph at some attribute 
node. We distinguish between two cases. If left^aiue G DG, then base sources 
{Bi, B 2 , ■ ■ ■ , Bi-i} need not be queried. If right^aiue G DG, then the update 
query should not include sources {Bj+i, Sj+ 2 , ' ' ' , 

Case 3: {left^aiue ^ DG) A {right^aiue ^ DG). Finally, this is the case where 
none of the joinable attributes values of the new update {AB^ = (6,3,6)) are part 
of the derivation graph. Then the effects of the change at B 3 on the materialized 
view need to be calculated by joining it with the base relation to the right and 
then to the left of B 3 . Note again that, if the derivation graph is dense, then the 
path to be computed may merge at some point with a path already present in 
the graph and only a subset of queries are necessary. 

3.3 Integrating Deletes 

Deleting a tuple (ui, • • • u, • • •) in one of the underlying base relations is inter- 
preted at the view as the deletion of an edge in the graph for each (vi,Vi), Vi 
being a value of a join attribute. However, if an edge also corresponds to at least 
one other tuple in the base relation, it should not be deleted. Consequently, 
there is a need to compute and store vector tuple counts as well as node tuple 
counts [11] at each node in the derivation graph. At a node corresponding to 
some attribute value u,, the node tuple count reflects the derived tuple count, 
the number of tuples resulting from joining base relations up to 5,, and whose 
derivation paths go through the node u*. 

Since the resulting tuple can be reached through different derivation paths, 
each node should store a vector count ^ such that each vector entry represents 
the counter for a given value of the primary attribute. The number of entries 
in a vector is the number of different values of the primary attribute, in our 
example attribute A. The following rule is used to initialize and maintain the 
node counters of a derivation graph, considering the derivation paths from left 
to right: the tuple count of a node is the sum of the vectors on incoming edges, 
multiplied by the corresponding edge tuple count. 

The propagation of a delete operation at the view leads to the deletion of 
edges that are not used by other derivation paths. In the best case, no edge is 
actually deleted, but the derivation path is followed and the counters associated 
with the path are modified. Counters are also modified in the case of a delete 
operation that leads to a partial, but not full deletion of the derivation path. 
For example, let the vector count associated with each node have the following 
entries: {v\ = l,v\ = 2>, v\ = 2). The resulting counters are shown in Figure 4. 
During the view maintenance, insert and delete updates in the derivation graph 

^ Note that we introduce the notion of vector counts solely for exposition purposes. 

In the full description of the algorithm, we will eliminate the need for vector counts. 
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Fig. 4. Vector counter associated with every node in the Update Derivation Graph 
before updates 



will obviously affect some of the tuple counters. Following the completion of the 
update queries triggered by the changes at source B 2 , the derivation graph and 
the associated node vector counters will be as illustrated in Figure 5 (for clarity, 
again we omit the direct edges between attributes A and E). 



A B D E 




Fig. 5. Vector counters associated with every node in the Update Derivation Graph 



4 Discussion on the View Maintenance Algorithm 

In this section we give an overview of the edge fitting maintenance method. Due 
to the lack of space, some details that can be found in [8] are avoided. 

If the edge fitting maintenance method is used, then most of the informa- 
tion of the view derivation graph can be hidden from users, while queries are 
answered based solely on the information stored by the nodes corresponding 
to the primary and measure attribute, and the direct edges between them. For 
both maintenance and query computations, we enforce an index structure for 
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each joinable attribute, as well as for the primary attribute of the first join- 
able relation, and the measure attribute. However, for maintenance, the storage 
overhead for additional information such as the vector counts can become ineffi- 
ciently large. As the number of attribute values changes, the size of a node vector 
counter should also be dynamically changed. Although inefficient from a storage 
point of view, the vector counts associated with the node in the derivation graph 
provide information on tuple counts at all the points on a derivation path. 

Consider eliminating the vector counts associated with every node in the 
derivation graph. Since the vector counts of the measure attribute values are 
needed for view queries, the corresponding information can be stored on the 
direct edges between the primary and the measure attributes. 

If the node vector counts are not maintained explicitly, their computation 
is necessary during view update. For comparison, consider again the derivation 
graph with vector counts. The insertion or deletion of an edge {left^aiue , rightvaiue) 
affects the number of tuples in join operations that involve the node correspond- 
ing to right^aiue ■ That is, the vector counts on the derivation paths going though 
the node right^aiue, and related to nodes following right^aiue and up to u„+i 
are modified. In the case where the insert/delete operation at the view requires 
the insertion/deletion of more than one edge, the same observation holds for 
each modified edge. In contrast, a view derivation graph without node vector 
counts does not offer the information on how many tuples are actually affected 
by the insert /delete of an edge, and this information needs explicit computation. 
This can be done by first traversing to the left the derivation paths that contain 
the node left^aiue, computing the total number of tuples affected by the inser- 
tion/deletion of the edge, and then traversing all the paths containing the node 
rightvaiue- More details on how an insert /delete operation affect view derivation 
graph without node vector counts, are provided in [8]. 

5 Conclusion and Future Work 

In this paper, we have presented a new approach for incremental maintenance of 
warehouse views. In order to support adaptive maintenance, we have developed 
a notion of view derivation graph that maintains intermediate information for 
computing warehousing views and have developed the edge fitting algorithm 
that allows, in the best case, to refresh the views locally as a result of updates to 
the underlying data sources. In particular, if the data has skewed distribution, 
which has been increasingly demonstrated for Web data, view derivation graph 
with edge fitting will have significant benefits. A proposal of an analytical tool 
to evaluate the suitability of the edge fitting approach for a given application 
data can be found in [8]. 

The trends in technology today lead to a partially disconnected world, where 
data can be generated on portable devices. Traditional approaches for incre- 
mental view maintenance rely on executing view computation queries on-the- 
fly by contacting base sources where data resides. With increasing reliance on 
warehousing and OLAP it has become important to provide data warehousing 
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support for non-traditional applications such as mobile and disconnected en- 
vironments, and unreliable systems. We therefore believe that solutions to the 
data warehouse maintenance problem need to be developed in the context of a 
disconnectable environment. 

The model presented is developed specifically for one or many independent 
data warehouse views that derive data directly from base sources. The dis- 
tributed multi- view data warehouse is emerging as a design solution for a growing 
set of applications, such as storage of messages from newsgroups [4] or coopera- 
tive web caches for the Internet. However, a more general data warehouse system 
allows views to derive data from other views [9, 6] as well as base sources. We 
propose to increase the capability of the model to also handle the hierarchical 
warehouse system. 
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Abstract. We describe the tuning of a gigabyte-scale TPC-D database 
for investigation of incremental maintenance of a materialized view. We 
find that incremental maintenance is feasible over a wide range of update 
sizes (granularities), that intuitive SQL formulations perform poorly, but 
that there are better alternatives. We show that these results can be 
meaningfully evaluated and presented only in the context of reasonable 
instance tuning. 
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1 Introduction 

Algorithms and frameworks for handling incremental maintenance of material- 
ized views in a data warehouse have been presented largely in the abstract, or 
tested on small databases[LYC+99]. This work presents the results of exploring 
(part of) the solution space for implementing incremental maintenance of a view 
in a data warehouse with a database population of an interesting size (nominally 
1 gigabyte). 

The tuning research results we are aware of focus on individual elements 
or operations to be optimized, often using techniques or features that are un- 
available in commercial DBMS engines. Fortunately, the importance of tuning a 
database has long been known to the people who operate commercial databases, 
and there is an extensive literature available from database vendors (e.g. [Ora], 
[ALS97]) and third parties (e.g. [GC96,Ada99]). One such source with collabo- 
ration from a number of vendors is [Sha92] . 

While they can be helpful, these volumes are on the one hand overwhelming in 
their massive content, and on the other hand are most useful in an operational 
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context rather than a research context. Blindly following the procedures and 
advice in these sources could lead to tuning the system specifically to the research 
tasks, but the goal is to explore performance in a realistic setting. Our approach 
has been to try for a “middle way” of moderate aggressiveness in tuning, but 
avoiding distortions to the computing environment. 

2 The Data and the Experimental Setting 

In this experimental setting one of the key components is the underlying data set 
that will be used to evaluate view maintenance strategies. We chose the schema 
of the TPC-D family of benchmarks which has the advantages that it is well- 
known in the DBMS research arena and that it comes with a generator program 
dbgen, and a prescribed set of queries. The generator can be used to populate 
the schema and also to create refresh sets which are updates comprising both 
insertions and deletions. We modified the updates to generate changes to our 
view, as described in the full version of this paper[OAEOO]. 

All tests were performed using Oracle 8i Enterprise Edition, version 8. 1.5. 0.1, 
on RedHat Linux Version 6.1, on a single system with a Pentium-Ill 550MHz 
CPU. Tests were run by execution of SQL*Plus scripts, sometimes for days at a 
time. 

All tests reported here were obtained with the same layout of partitions and 
tablespaces; moreover, identical or functionally similar tables were allocated to 
the same tablespaces in all tests reported here; different allocations were used 
initially and were helpful in validating the correctness of the scripts. 

We chose a view which was a nearly complete denormalization of the schema, 
in effect precomputing the joins required by several of the prescribed queries 
(numbers 7, 8 and 9). In building this view we encountered one consequence of 
the large size of the database: if even a single stage of the query plan has ex- 
ponential complexity, the computation is infeasible. By various methods, mostly 
by providing statistics and hints to the optimizer, we got the time down from 
failure after 48 hours to success after 15 hours, using a query plan based mostly 
on hash joins. 

We verified that queries 7, 8 and 9, when recoded to use the materialized 
view, increased in speed by factors of 603, 18 and 14. 

2.1 Update Granularity 

We measured results for five updates of 2400 rows each. These were computed 
in increments {grains) of multiples of 25 rows: 25, 50, 75, 100, 150, 200, 300, 400, 
600, 800, 1200 and 2400 orders each. The tests were organized in such a way that 
the same five results were being computed at each granularity, and only the grain 
size and the number of grains were changing. This range was an interesting one 
since it seemed likely at one end that the view tuples being changed would fit in 
main memory at fine granularity and would not fit at the coarsest granularity. 
The range from 0.002% to 0.2% of the rows being updated by each grain, seemed 
reasonable for warehouse updates. 
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3 Incremental Maintenance of the View 

The calculation of the delta for each update used the algorithm from the Posse 
[OAE99] framework. The calculation proceeds as a series of semijoins, and is 
followed by installation of the computed change into the view. Installation was 
explored with SQL and with a cursor-based algorithm. Times are pictured in 
Figure 1. 




Fig. 1. Execution times with 95% confidence intervals. 



3.1 Using SQL 

The installation of the computed delta was very slow on our first attempts with 
SQL. The installation was performed in three consecutive SQL commands: an 
UPDATE of rows already in the view, an INSERT of new rows, and a DELETE of 
rows no longer needed. The INSERT had acceptable performance from the start, 
but the UPDATE and DELETE were an order of magnitude slower. 

Tuning an Update. The original form of the UPDATE was the one corre- 
sponding to the English description “update rows in the view which correspond 
to rows in the delta by adding the values in the delta”. This works out to a 
SQL coding style called “coordinated subqueries” . Following is an outline of the 
query: 
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update WARE set WARE. value = WARE. value + 

(select DELTA. value from DELTA where DELTA. <key> = WARE.<key>) 
where exists (select 'x' from DELTA where DELTA. <key>=WARE.<key>); 



In spite of many trials, we could not make the optimizer generate a query 
plan that drives the update by the key on the delta. Instead, the update used a 
plan that tested each row of the view for a matching row in the delta, and altered 
the ones with a match. All of the nearly 6 million rows of the view are read to 
perform this operation, and the elapsed time is large in proportion to the size 
of the view table. Eventually, we even found that this behavior is documented 
for Oracle, but we understand that this is the case in at least some other DBMS 
engines as well. 

Eventually, a suggestion from USENET[Lew99] led us to a much better idea: 
performing an update on a join, in which the optimizer is free to choose the 
better table to drive the join. This improved the speed of the update by an 
order of magnitude because the cardinality of the view is at least 1,000 times 
that of the delta, and for fine-granularity updates, this difference is even more. 
As modified, the update is as follows: 

update (select DELTA.*. WARE.* from DELTA. WARE 
where DELTA. <key> = WARE.<key>) 
set <changes>; 



The performance of this form is essentially that of a physical index read and 
a row update for each affected row of the view, because the uniform distribution 
of the data makes it unlikely that any two rows will share a database block. 
The point here is that had we not discovered this way to tune the SQL, subse- 
quent comparisons would have been invalid because the SQL would have been 
unnecessarily impaired. 



Tuning a Delete. For a view that contains statistical data, it would be feasible 
to omit the deletion step entirely, and allow tuples to remain that contain only 
zeroes outside the identifying attributes. We chose to include and measure the 
deletion step, but tuned the SQL. 

Since there is no way in SQL to specify conditional deletion in connection 
with an update, deletion has to be a separate step. In order to unambiguously 
identify rows to be deleted, the view contains an attribute count that encodes 
the number of rows of the LINEITEM table that contribute to the entry in the 
view. A row can be deleted when this value goes to zero. 

Tuning of deletion required considering three ways of formulating the SQL. 
The obvious way is to test each tuple for a count of zero, but this requires a 
complete scan of the view. A faster way is to use an index on the count attribute, 
at the cost of maintaining the index, but with the advantage that only the tuples 
to be deleted (i.e. those with count equal to zero) will be accessed. A third way 
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involving triggers will be described below. Since only tuples that have just been 
updated by the delta will have changed, just these tuples can be examined. 
Formulating efficient SQL for this is a challenge similar to the one for tuning the 
SQL of the update, and results in a formulation like the following^, due again 
to help from USENET [Lew99]: 

delete from WARE ww where (ww.<WARE key attributes>) 
in (select dd.<DELTA key attributes> from DELTA dd); 



Triggers. These SQL versions used an index on the count attribute to support 
deletion, and to see whether removing this requirement would close the per- 
formance gap with the cursor-based version, the tests were re-run on a version 
that used a trigger to maintain a table of rows to be deleted. This improved 
performance considerably. 



3.2 Using a Cursor 

No matter how much we tuned the individual SQL commands, they remained 
three distinct commands, and the optimizer does not apply any global optimiza- 
tion to take advantage of the fact that all three are working with the same data. 
In order to do that, we were obliged to code a query plan that SQL cannot 
accomplish, and to do that we used a cursor-based approach. The idea is simple: 
the process will proceed through the delta one row at a time and make whatever 
changes are necessary to the view for that row. The resulting cursor-based code 
was faster, by about a factor of two, in all tests. The code for cursors, omitted 
for reasons of space, is included in the full version of this paper[OAEOO]. 

4 Conclusion 

We have presented our experience with performing timing testing of view main- 
tenance for a warehouse on a commercial DBMS, and some of the pitfalls for 
the unwary. 

— Tuning is essential to arriving at correct conclusions about algorithm perfor- 
mance; as a corollary, reporting on tuning-related aspects of the computing 
environment is essential for others’ evaluation of reported results. 

— Comparing our tuned algorithms, at some granularities we find that cursor- 
based updating may be as much as four times faster than SQL with an index- 
driven DELETE, and somewhat faster than a trigger-based SQL solution. 

^ Actually, hints to the optimizer are also required, but are omitted because they are 
not portable. Of course, on different systems it could be that any of these formula- 
tions would work differently or require additional or different hints to the optimizer. 
Portability of SQL really is an illusion. 
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The untuned versions were sometimes so bad we were unwilling or unable 
to complete timing tests. We conclude that comparing untuned algorithms 
is meaningless, and running them is not sensible. 

— A version using cursors was the clear winner among algorithms for incre- 
mental maintenance. This algorithm makes use of global properties of the 
computation and cannot be produced by optimizers which operate one SQL 
statement at a time. 

— We saw improvements from a number of tuning techniques. There was no 
clear winner. One surprise was that the most efficient SQL for a given task 
may be quite different from the first formulation that comes to mind, al- 
though they may be theoretically equivalent. 

— Incremental view maintenance appears feasible at all the grain sizes we 
tested, although naturally larger grains lead to longer periods with locked 
data. The overall time was surprisingly uniform; it appears that economies 
of scale were offset by cache effects to some extent. We conclude that where 
other considerations allow, the short update window that goes with small 
update granularities may be attractive. 

— Incremental maintenance is at least an order of magnitude faster than com- 
plete recomputation of the view. At one gigabyte of data, complete recom- 
putation is not feasible on more than a very occasional basis. 
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Abstract. The generation of representative sample data for data warehouses for 
benchmarking, testing and presentation purposes is a challenging complex task. 
Difficulties often arise in producing familiar, complete and consistent sample 
data of any scale. Producing sample data manually often causes problems, 
which can be avoided by an automatic generation tool producing consistent and 
statistical plausible data. The data generated by the tool is based on the linear 
statistical model of a hierarchical n-way classification with fixed effects. We 
present an approach for a flexible generation of “real world” sample data using 
the BEDAWA (BEnchmarks for DAta WArehouses) tool. 



1. Introduction 

In the field of data warehouses [2], [4], [11] and OLAP [5], [7], [8] technologies, 
sample data is needed for different purposes, such as benchmarking, testing, and 
demonstrating. Unfortunately there exists a lack of tools available to generate sample 
data, which is familiar, consistent, scalable, and reflective multiple degrees of 
freedom. Referring to sample data generation problem, [14] analysed the samples for 
data warehouses and OLAP products of well-known companies, and concluded that 
the generation of test-data for their products is not yet sufficiently solved. 

The sample data generation process is usually a complex, iterative process that 
begins with a modelling step, and ends with the generation of the desired sample data. 

In this paper, we propose BEDAWA -as a tool to generate generic sample data for 
data warehouses. To get representative sample data that is familiar, consistent, 
scalable, large amount and reflects many degrees of freedom, the BEDAWA tool is 
based on a statistical model that will be introduced in section 3. The statistical 
correctness of the model provides a framework to define relationships between 
dimensions and facts in the context of star schema, the most popular schema for 
designing and building a data warehouse [10], [12], Furthermore, BEDAWA provides 
abilities to define and generate sample data of any size in order to reflect a real world 
situation. In addition, BEDAWA is also able to integrate existing external data 
sources for the sample data generation. 

The remainder of this paper is organized as follows. In section 2, we discuss 
related works. Section 3 introduces the statistical model of our prototype. In section 4, 
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we present the BEDAWA sample data generation process. A description of the 
sample data generation process of BEDAWA is presented in section 5. A typical 
example of generating sample data for Data Warehousing is shown in section 6. The 
paper concludes with section 7, which presents our current and future works. 



2. Related Works 

Benchmarks are used for measuring and comparing the operations of systems or 
applications working on these systems. This especially applies to databases where 
many efforts have been made to develop tools for benchmarking purposes. 

The Wisconsin benchmark [6] has been proved as suitable benchmark for 
measuring the queries on relational database systems. AS^AP is another benchmark 
that tests multi-user models by different types of database workloads [17] for 
relational databases. The TPC-D benchmark [16], a decision support benchmark was 
introduced in 1990, with its first version completed in 1995. In 1998, the TPC-R and 
TPC-H benchmarks [18] were introduced to allow more precise and domain specific 
performance measurement. 

There arc many other benchmarks that have been developed for testing and 
comparing different systems in various application domains such as “The Set Query 
Benchmark”, “The Engineering Database Benchmark”, and “Engineering 
Workstation-Server” [3]. 

For data warehouses and OLAP. few approaches are known in the literature. In 
[13], an OLAP benchmark, called APB-1, was issued to measure the overall 
performance of an OLAP server. Typical OLAP operations are included in this 
benchmark such as bulk loading of data from internal or external data sources, 
calculation of new data, time series analysis, and so on. 



3. Statistical Model 



Need of Statistical Model 

The definition of a benchmark can be a very time consuming task. For instance, for 
defining TPC-A benchmark, TPC required nearly 1200 man-days of effort [15]. This 
is particularly true for data warehouse and OLAP systems, where the required data 
volume can be become huge. For business purposes we need a save analysis of the 
information dependencies of the enterprises to create data for an OLAP system which 
is consistent with respect to the different granularity levels and the interdependencies 
between different dimensions. To generate sample data for these systems, a rigorous 
calculation model which is based on a sound statistical approach is needed. The 
model does not only allow users to present the relationships between the data 
elements, but also take into consideration the limitation given by the degrees of 
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freedom of generated sample data. In the process of generating such large and 
complex data without the help of an appropriate model, unpredictable consistency 
errors can appear and they make a tool being useless, i.e., the generated sample data is 
neither consistent, nor realistic. 

The Statistical Model 



For simulating realistic OLAP situations, depending on the structure of future sample 
data, a suitable calculating model can be selected. Various calculating models can be 
applied on different structures of the sample data with different meaning. We apply a 
linear statistical model to calculate facts in the context of the star schema [10]. This 
linear statistical model is applied to present the relations between the fact and the 
dimensions. Each element of a dimension will have an effect to the fact, called “fact 
effect value”. A fact value of a fact is calculated basing on the fact effect values of 
elements of the dimensions. Considering to the relationship between the fact and the 
dimensions, which effect to the fact, the actor who interacts with the BEDAWA tool 
will define these relationship by using the statistical linear model of -way 
classification with fixed effects. For simplicity and readability reason we omit the 
interaction. In the following, we will use in the paper the n-way classified model 
without interaction. 

The linear model, which can be found often in the real world is extremely suitable 
for presenting the relations between the dimensions and the fact in star schema. 

The linear model used in the BEDAWA tool is described as follows [1]: 



yijk... = [I + tti + pj + Yk + ... + eijk... (1) 



where; 



a, p, Y, ... 



i,j,k,... 

P 

Eijk... 

yijk... 



: fact effect value sets of dimensions Eh, Dp, Dy that have effect on 
the fact y, the dimensions have numbers of elements in their domains 
are n, m. p, ... respectively, 

: index numbers that are in range [l..n], [l..m], [l..p], ..., respectively, 

: average mean of the fact, 

:fact effect values of elements P, j'', 1^^ of dimension fact effect 
value sets a, p, y,..., respectively, 

: error random value, using normal distribution function, N(O,0'). 

: a value of the fact. 



And the conditions: 

n m P /'0\ 

£ a , = 0 ’ E P , = 0 ’ E y . = 0 ’ - 

i=l j=l k = l 

In this model, we assume that there is no interaction between the fact effect values 
of each couple of dimensions. That means the fact effect values of a dimension do not 
depend on the fact effect values of any other dimensions, which influence to the fact. 

In context of the hierarchy structure of a dimension, the dimension data can be 
presented in a tree with different levels of its hierarchy, see figure 1 . A branch of this 
tree presents an element in the dimension domain. Consequently, the fact effect value 
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of the dimension applied to a fact can be calculated dependent on the effect of all 
nodes of this branch. For each node of the dimension tree, the node’s fact effect value 
can be defined for the fact. The fact effect value in this context is the effect of the 
branch from the root to the leaf of the dimension tree. We calculate the fact effect 
value of a branch by summing all fact effect values of all nodes of that branch. For 
instance, the fact effect value otj of the dimension D« is calculated by the formula: 

CL-—CI-+CI- + (X- + . . . ( 3 ) 

‘ j Jk Jti ^ ' 

where a - , a ^ , a ■ ... are the fact effect values of the nodes of the branch a;. 

J Jk Al 




a., ttg ttg 

Fig. 1. The fact effect value tree of the dimension 

By the condition (2) we can apply the condition that at any node of the dimension 
tree, sum of all node fact effect values of its sub-nodes is equal to zero: 

= 0 , = 0 , = 0 ,... (4) 

j k I 

Example 

To demonstrate our approach, wc take the star schema of well-known grocery 
example of Kimball [11], see figure 2. The figure 3 shows the relation between the 
fact Dollar _Revenue (y) with 3 dimensions: ProduetDimension (a), TimeDimension 
(p), and StoreDimension (y) and an average mean (p,). 

From the formula (1) and the definition of the fact effect value tree, we have a 
corollary that one dimension can affect different facts and therefore they have 
different fact effect value trees. For instance, the fact Customer_Count is affected by 
the dimensions TimeDimension and ProduetDimension. These dimensions can have 
other fact effect value trees, e.g. a’, p’, y”, which are used to calculate the 
Customer _Count fact values. 
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Fig. 2. A star schema of the GroceryStore example 
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Fig. 3. Effect of dimensions to the fact Dollar_Revenue 



4. The Sample Data Generation Process 

The sample data generation is an iterative process, which starts with the modelling of 
the required sample data, and ends with the automatic generation of this data. During 
this process there are many possible iterations that are dependent on the results of the 
sample data generation. If the produced data is not satisfying or errors occur during 
the sample data generation, the design of sample data has to be changed or altered to 
achieve a better result. An actor can repeat the generation of the sample data, until 
(s)he receives the expected result. Typically, during the first iterations the user 
generates smaller amounts of sample data, and after assuring that all requirements are 
reasonably fulfilled the user can start with the production of other large amounts data. 

A sample data generation process may be characterized by three functional areas: 

• The build-time functions, concerned with defining, and the possibly modelling 
of sample data requirements and definitions 

• The run-time functions, concerned with executing the production of sample data 

• The run-time interactions with human users and applications for controlling the 
sample data generation 

Figure 4 shows the stages for a sample data generation and the relationships 
between these main functions. 

Build-time Functions. Build-time functions are used to build all necessary 
definitions for the desired sample data. These definitions cover the full description of 
the sample data being built for the data warehouse. 

Run-time Functions. At the run-time stage the data of the repository with the 
sample data definitions is used to generate the desired sample data. Run-time 
functions provide the functionality to produce sample data that corresponds to their 
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definitions. They check the consistency of sample data definitions and ensure that all 
constraints are satisfied. 




Fig. 4. Sample Data Generation Process 

Run-time Interaction Functions. The run-time interaction functions act as links 
between the run-time functions and the actors, which manage the sample data 
generation service (SDGS) from the outside. Actors can be either human users or 
applications that are able to control the sample data generation by interactions like 
executing, cancelling or adjusting the sample data generation. Interaction with the 
SDGS is necessary to initiate the generation process, to treat exceptions during the 
sample data production (i.e. by adapting the sample data definitions) or to stop the 
generation process. 



5. BEDAWATool 



Overview 

In BEDAWA project, wc implemented a prototype of the previously mentioned 
sample data generation p-ocess. The statistical model introduced in section 3 is 
implemented in this prototype. Additionally BEDAWA provides functionality to 
define constraints for the generation process. This additional feature makes it easier 









BEDAWA: A Tool for Generating Sample Data for Data Warehouses 89 




Every time sample data is generated for the data warehouse, the following 1 1 steps 
arc required (see figure 5). Steps 1-8 belong to the build-time stage. This stage covers 
the modelling and the design of the sample data. In this stage all definitions for the 
sample data are completed which are required to generate the sample data 
automatically. The next steps 9-11 belong to the run-time stage and are necessary to 
generate the modelled sample data. All steps are described in the following section in 
more detail. 

The run-time interaction stage covers the interaction with human users and 
applications for controlling the sample data generation. That means, depending on the 
result of the sample data generation in this stage, possible iterations of the presented 
steps can be necessary. If for instance the result of the sample data production 
doesn’t fulfil the requirements, the actor will have to go back to the build -time stage 
to model the sample data correctly. The actor iterates between these steps until the 
produced sample data is satisfying. 
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The Use of BEDAWA in the Process of Sample Data Generation: 

• Data Source Definitions. Data source definitions provide flexibility in support of 
multiple heterogeneous data sources. They hold the connection information to data 
providers, which can be any supported xternal data sources (i.e. relational 
database, Hat file). Data source definitions are used in our model for two purposes. 
First, these definitions can be used to define the connection information for 
retrieving pertinent data for the definition of sample data from a data provider. 
Second, a data source definition has to indicate, where the generated sample data 
has to be stored. 

• Dimension Definitions. Dependent on the characteristics of the modelled sample 
data, different dimension structures are required. For generating sample data that is 
used for different purposes on various systems, the BEDAWA tool supports a 
flexible way to define the dimension structures and is also able to automatically 
generate data for dimensions. The hierarchy of a dimension can be defined by 
expected number of levels, and the data type for every level. 

• Distribution Definitions. The distribution definition is used to specify the 
distribution of the sample data that simulates real life data. By making statistics on 
real life data, e.g., statisticians can answer the question, “How many percent of a 
product was bought? ”. In contrary, the generated sample data process uses 
distribution information to distribute the reference of the dimension elements to a 
fact table. The distribution information is defined by percent numbers, and has to 
be provided for every node of a hierarchy level of a referenced dimension. 

• Fact Effect Definitions. Based on the tree representation of the dimension domain, 
the fact effect value of every node of the tree has to be defined. For each relation 
between a dimension and a fact, one or more fact effect value set(s) can be defined. 
The fact effect values, defined for all nodes of the tree have to satisfy the condition 

(4). 

• Fact Definitions. As mentioned, one or more fact effects (a,p,y) can be used to 
define a fact for determining their relationship. Furthermore, an average mean (p,), 
and an error random value (Ejji;) are necessary to specify a fact in this step. 

• Fact Table DefinitionsT The structure of the act table can be defined by 
specifying a list of facts and dimensions. For example, if the fact table Sale2000 is 
defined by specifying the fact SaleCount and three dimensions Geography, 
Product, and Time, then the structure of the fact table SalelOOO consists of three 
foreign key columns, e.g. Geography_lD, Product_ID, and Time_ID\ and one fact 
column, named Sale Count. 

• Constraints. Depending on the constrain definitions, the actors can decide how the 
generated sample data should looked like. At any time h the build-time stage, 
actors can define constraints for the data of the fact table and boundaries for other 
generated data. For instances, in the fact definition step, constraints for the fact can 
be defined, e.g., a fact value of this fact must be in a range [x..y], a fact value must 
be greater than zero and so forth. 
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• Fact Table Profile Definitions. A fact table profile definition contains all 
information to generate a fact table including 1) Data source definition, in which 
the sample data will be stored, 2) a fact table definition is used to create a fact table 
and to derive a list of fact definitions and dimension definitions. In some case, the 
name of fact table can be changed in this step, 3) the number of records, which 
should be generated in the fict table, 4) distribution definitions are used to 
distribute data for dimension columns in the fact table, and 5) the constraints are 
used to control the generation process. 

• Sample Data Generation. Sample data generation consists of three main steps: 1) 
check consistency, 2) execute fact table generation, and 3) check constraints. After 
selecting a fact tabic profile, the generation process begins with checking the 
consistency of the sample data specification. In the next step IDs are first 
distributed to dimensioned columns in the fact table. Afterward, fact values are 
calculated for every fact column. The last step is checking whether the generated 
data fulfils the constraints or does not. If not, generation step must be re-executed 
until the generated data meets the requirements of the application. 



6. Results 
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Fig. 6. The representation of the fact table Grocery (first 1 8 rows of 1 0000) 
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The BEDAWA tool has been used to generate several results of sample data for 
purposes of benchmarking, testing and demonstrating data warehouses and OLAP 
systems. In this section we show how to generate sample data for the well-known 
GroceryStore example from [11]. The GrocerySlore schema is presented in figure 2. 

Following the steps, in section 5, we generate GroceryStore fact table with 10000 
records, the generated sample data is shown in figure 6. Generally, there is no 
limitation of the number of generated records using the BEDAWA tool. This ability 
ensures that the tool can be used to define and generate scalable sample data adapting 
for the need of various usage purposes. 

7. Conclusion and Future Works 



We introduced the need of having a tool to generate sample data for data warehouses. 
In the context of generating sample data, we presented the tool within the BEDAWA 
project. The BEDAWA tool has a sound statistical foundation by using and 
implementing the statistical model of an n-way classification with fixed effects. The 
theoretical soundness of the model provides a framework to define elationship 
between dimensions and facts in context of the star schema. 

The paper presented the process of generating sample data for data warehouses. 
The process provides abilities to define and to generate sample data. In addition, the 
tool provides functionality to define constraints to check whether the sample data 
meets the application requirements or does not. 

In conclusion, the implemented prototype is able to flexibly and effectively 
generate sample data of specific applications. As a result, the generated sample data is 
statistical correct and “looks like” real business data. 

In the future, we will extend the BEDAWA tool to support complex data structures 
for sample data such as snowflake schema. Further research must be pay attention to 
extending the adaptability of the statistical model to simulate real life data, that 
means, more statistical models are need to present various relationships in the 
generated sample data. Furthermore we arc going to offer the BEDAWA tool as a 
web service. Eor this purpose we extend our prototype with CORBA technology to 
allow user unlimited access to the sample data generation engine of the BEDAWA 
tool. In addition we offer the sample data in XML and DDL (Data Definition 
Language) to provide a wide application of the generated data. 
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Abstract. Data warehouses (DW) are an emerging technology to sup- 
port high-level decision making by gathering information from several 
distributed information sources (ISs) into one materialized repository. In 
dynamic environments such as the web, DWs must be maintained in or- 
der to stay up-to-date. Recent maintenance algorithms tackle this prob- 
lem of DW management under concurrent data updates (DU), whereas 
the EVE system is the first to handle {non- concurrent schema changes) 
(SC) of ISs. However, the concurrency of schema changes by different ISs 
as well as the concurrency of interleaved SC and DU still remain unex- 
plored problems. In this paper, we propose a solution framework called 
DyDa that successfully addresses both problems. The DyDa framework 
detects concurrent SCs by the broken query scheme and conflicting con- 
current DUs by a local timestamp scheme. The two-layered architecture 
of the DyDa framework separates the concerns for concurrent DU and 
concurrent SC handling without imposing any restrictions on the auton- 
omy nor on the concurrent execution of the ISs. This DyDa solution is 
currently being implemented within the EVE data warehousing system. 



Keywords: Data Warehousing, View Maintenance, Data Updates and Schema 
Changes, Concurrency. 



1 Introduction of Data Warehousing 

Data warehouses (DW) are built by gathering information from several au- 
tonomous Information Sources (ISs) and integrating it into one virtual repos- 
itory. Data warehousing [6] has importance for many applications in network 
environments, such as travel services. E-commerce, decision support systems, 
and other web related applications. An important consequence of the autonomy 
of sources is the fact that those sources may change without being controlled 
from a higher data integration layer. 
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1.1 DW Management nnder Fnlly Concnrrent Environment 

Many sources, particularly WWW-based data sources, change not only their 
data, but also their schema or query interface without cooperating with users of 
their data. Even conventional databases typically undergo schema changes due to 
many reasons, including application requirement changes, corrections of schema 
design errors, upgrades, etc. Case studies both by [11] for health management 
applications and [5] for different business type of application reveal that schema 
changes are inevitable not only during the development of a project but also 
once a project has become operational. Even seemingly stable data sources that 
appear to be only undergoing data updates may indirectly cause schema changes 
to be propagated to the DW via these wrappers, e.g, SchemaSQL [3]. Therefore, 
any data warehouse has to take the possibility of IS schema changes into account. 

Eurthermore, since ISs are autonomous, schema changes can occur concur- 
rently and can be mixed arbitrarily with data updates from other ISs. Hence, 
data warehouses built on top of those ISs have to meaningfully handle these 
updates if their views are to survive over a period of time. 

There are three types of tasks related to DW management in distributed 
environments. The most heavily studied area is the incremental maintenance of 
materialized views under distributed ISs. Such view maintenance (VM) algo- 
rithms [12, 1, 16, 15] maintain the extent of the data warehouse whenever a data 
update (DU) occurs at the IS space. The second area called view synchronization 
(VS) [9, 10] is concerned with evolving the view definition itself whenever there 
is a schema change (SC) of one of the ISs that results in a view definition to 
become undefined. The third area, referred to as view adaptation (VA) [2, 6, 7], 
adapts the view extent incrementally after the view definition has been changed 
either directly by the user or indirectly by a view synchronization module. 

Materialized view maintenance (VM) is the only area among those three that 
thus far has given attention to the problem of concurrency of (data) updates at 
ISs [12, 1, 16, 15]. Our recent work on Coop-SDCC [14] is the first to study the 
concurrency problem of both data updates and schema changes in such envi- 
ronments. The Coop-SDCC approach integrates existing algorithms designed to 
address the three DW management subproblems VM, VS and VA into one sys- 
tem by providing a protocol that all ISs must abide by and that as consequence 
enables them to correctly co-exist and collaborate. In that solution, the SDCC 
system locks the ISs to wait for the mediator to accomplish the maintenance of 
the data warehouse before admitting a SC to be executed at that IS and thus 
propagated up to the DW. This solution has however the limitation of requir- 
ing information sources to cooperate by first announcing an intended SC, then 
waiting for the DW to finish servicing any outstanding requests, before being 
permitted to execute the schema update at the IS. 



1.2 Our Approach 

In this paper, we overcome the limitation of this previous solution. We now 
propose the Dynamic Data warehouse (DyDa) framework that can handle fully 
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concurrent SCs and DUs without putting any restriction on the timing of when 
an SC is allowed to take place nor requiring any cooperation from the ISs. 

Since we release the restriction of cooperative ISs, during the process of 
managing a data warehouse, query requests that are sent down to different ISs 
[1, 15], new updates (DU or SC) could occur concurrently at some of the ISs. We 
call such updates concurrent updates. A concurrent DU will result in an incorrect 
query result returned by an IS [14, 8], whereas as a concurrent SC results in a 
broken query that cannot be processed by the ISs, i.e., an error message. 

In this work, we provide the first complete solution called DyDa with: 

— We characterize the problem of maintenance under concurrent SCs, which 
we call the broken-query problem. 

— We devise a strategy for the detection of concurrent SCs based on the broken 
query concept, that identifies the causes of the conflict. 

— We analyze selected VM, VS and VA algorithms from the literature to de- 
termine if and how they are affected by concurrent SCs, and thus elicit their 
properties required to function correctly in this dynamic environment. 

— We introduce the overall solution framework DyDa that adapts a two-layered 
architecture that separates the concerns for concurrent DU and concurrent 
SC handling without imposing any restrictions on the fully concurrent exe- 
cution of the ISs. 



2 Problem Definition 

Background on Schema and View Definition Changes. We have to first 
understand how DW management algorithms, i.e., view maintenance (VM), view 
synchronization (VS), and view adaptation (VA), work together (see the Figure 
1). The VS algorithm will update the view definitions affected by schema changes 
(SC) at the IS. The output of VS will be a set of view definition changes (VDCs), 
which are used to update the affected view definitions. Then the VA algorithm 
will adapt the extent of the rewritten views based on the set of VDCs. At the 
same time, VM will continuously update the view extent based on data updates 
(DU) that independently may occur at different ISs. 




Fig. 1. Relationship between VM, VS and VA Algorithms 
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Definition of the Maintenance-Concnrrent Problem. While the concept 
of maintenance concurrency has previously been defined for data updates only 
[12], we now extend it to incorporate schema changes. In the following we assume 
the views are defined in SPJ queries and the connection between each IS and 
DW is FIFO. 

Figure 2 defines the main notations we use. A sequential number n, unique 
for each update, will be generated by DyDa whenever the IS update message 
reaches the system. 

Definition 1. Let X(n)[j] and Y{rn)[i] denote either DUs or SCs on IS[j] and 
IS[i] respeetively. We say that the update X(n)[j] is maintenance-concnrrent 
(in short concnrrent) with the update Y(m)[i], denoted X(n)[j] h Y(m)[i], iff: 
i) m <n, and ii) X(n)[j] is reeeived at the DIF before the answer QR(m)[j] of 
update Y(m)[i] is reeeived at DW. 



Notation 


Meaning 


ISlil 


Information source with sub- 
script i. 


X(n)lil 


X is update (SC or DU) from IS[iJ 
at sequence number n. 


Q(n) 


Query used to handle update 
X(n)[il. 


Q(n)li] 


Sub-query of Q(n) sent to IS[iJ. 


QR.(n)lil 


Query result of Q(n)[iJ. 


QR(n) 


Query result of Q(n) after re- 
assembly of all QR(n)[i] for all i. 




Fig. 2. Notations and Their Mean- Fig. 3. Time Line for a Maintenance Con- 
ings. current Data Update. 



Figure 3 illustrates the concept of a maintenance-concurrent update de- 
fined in Definition 1 with a time line. Messages only get time stamps assigned 
upon their arrival at the DW layer. That means that the maintenance-concurrent 
update is defined with respect to the data warehouse layer instead of the IS layer. 
Assume we have one DW and two IS[1] and IS [2]. First, there is a schema change 
SC at IS[lj. Then, there is a data update DU at IS[2j. From the figure, we can see 
that the SC is received by the DW before the DU, but DU occurs at IS[2] before 
the adaptation query Q(l)[2] of SC arrives at IS[2], that is, DU occurs before 
the query result QR(1)[2] arrives at the DW. So, here the DU is maintenance 
concurrent with SC by Definition 1. 

There are four types of maintenance-concurrent updates listed in Table 
1 in the order of the easiest to the hardest in terms of handling them. 

Definition 2. A query is ealled a broken query if it eannot be proeessed by 
an IS beeause the sehema of the IS expeeted by the query is not eonsistent with 
the aetual sehema of the IS eneountered during query proeessing. 
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Type 


Meaning 

A maintenance-concurrent DU happened 


Denoted By 


I 


when handling DU 


DUh - DUmac 


II 


when handling SC 


SCh - DUmac 


III 


when handling DU 


DUh - SCmac 


IV 


when handling SC 


SCh - SCmac 



DUh (or SC'h) denotes the DU (or SC) that is currently being handling by the DyDa system. 
DUmac (or SCmac) dcuotcs thc DU (or SC) that is a maintenance-concurrent DU (or SC). 



Table 1. Four Types of Maintenance-Concurrent Updates. 



We distinguish between three types of broken queries based on the type of 
SC causing the problem as well as the characteristics of the IS space available 
for the system to deal with this problem. 

— Type 1: Broken Queries caused by a RenameSC. In this case the data 
is still there but only the interface of this IS relation has changed. We can 
use a name mapping strategy to get to the desired data. 

— Type 2: Broken Query caused by a DropSC with Replacement. 
While the data has removed from the IS, we are able to find the required 
information from an alternate source that holds duplicate data. 

— Type 3: Broken Query caused by DropSC without Replacement. 
The data has really been dropped from the IS, and the system is not able to 
identify an alternate source for the data. 

AddSC will not result in broken queries as they do not interfere with the 
interpretation of any existing query. A broken query as defined in Definition 2 
will be returned by the respective IS as an empty query result with a “broken 
query” error message. 



3 The DyDa Framework 

3.1 Overall Solution Architecture 

The DyDa framework is divided into three spaces: DW space, middle space, and 
IS space as described in Figure 4, which depicts the modules of our proposed 
framework and the data flow between them. The DW space houses the extent 
of the DW. It receives queries from the middle space bundled together with the 
data to update the DW. The IS space is composed of ISs and their correspond- 
ing wrappers. Wrappers are responsible for translating queries, returning query 
results, and sending notifications of DUs and SCs of the ISs to the middle space. 

The middle space is the integrator of the DyDa framework. It can be divided 
into two subspaces. The higher-level subspace is called the DW management 
subspace. All the DW management algorithms, like VS, VA and VM, are lo- 
cated in this subspace. The lower-level subspace is called the query engine 
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LEGEND 




Fig. 4. Architecture of DyDa Framework 



Meaning of Symbols 
Used in Figure 4. 



Sym- 

bols 


Meaning 


V 


View definition 

affected by either 
Schema Change 

(SC) or Data Up- 
date (DU). 


V’ 


Evolved view defini- 
tion of affected view 
V. 


DW 


Data warehouse. 


AV 


Incremental view ex- 
tent of data ware- 
house, i.e., set of tu- 
ples to be inserted 
into or removed from 
the extent of view V. 


GQR 


Query result that is 
returned by QE. 


VS- 

VA 


All information VA 
module requires 

from VS module for 
view adaptation. 

It includes: V, V’, 
Meta-knowledge, 
Synchronization- 
mapping. 


VAQ 


View Adaptation 

Query. 


VAQR 


View Adaptation 

Query Result. 



subspace. This subspace is composed of the Query Engine (QE) and its sup- 
porting modules, namely, the Update Message Queue (UMQ) module and Assign 
Time Stamp module. The UMQ module captures all updates received, and the 
Assign Time Stamp module is used to assign a time stamp to each incoming 
message, i.e., SC, DU and query result, based on their arrival. The two sub- 
spaces effectively correspond to two different levels of concurrency control. The 
key idea is that maintenance-concurrent DUs will be handled locally by the 
QE module at the lower-level of the middle space, so that DW management al- 
gorithms at the upper-level, such as VS, VM and VA, are shielded from and will 
never be aware of any maintenance-concurrent DUs. While our associated 
technical report [13] gives more details on the dynamicity and protocol of the 
system, we now discuss the justification of the two-layered concurrency control 
in Section 4. 



4 Two Levels of Concurrency Control of DyDa 

While the detection of concurrent DUs is based on the local timestamp assigned 
to DUs upon their arrival at the DW [12], the detection of concurrent SCs is 
based on identifying when and why a submitted query is broken and hence 
returned unanswered. 
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In the DyDa framework in order to separate out the handling of concurrent 
data updates and schema changes, there are two levels of concurrency control 
corresponding to the two sub-spaces of the middle space. At the lower level of 
concurrency control (the query engine subspaee), the concurrent DUs as well as 
the concurrent SCs of the type RenameSC will be handled by the QE. The 
DW management subspaee supports the higher level of concurrency control. The 
management algorithms (e.g., VM, VS and VA) at that subspace cooperate with 
each other in order to handle schema changes of the type DropSC. AddSC SCs 
do not render view definitions undefined and hence do not affect VM, VS nor 
VA. Thus they do not need to be handled by DyDa. 

4.1 Low Level Concurrency Control at QE Subspace 

DyDa solves the type I and II maintenance-concurrent problems (see Table 
1) at the QE level by the local correction strategy and part of type III and IV 
maintenance-concurrent problems (see Table 1) by a local naming mapping 
strategy. 



Using Local Correction to Handle Maintenance-Concurrent DUs. All 

queries from the DW modules down to the IS space will first go through the 
QE module (Figure 4). This includes incremental view maintenance queries, 
view adaptation queries, or view recomputation queries. Given that all three 
query types are “extent-related queries”, the query engine will use the local 
correction (LC) algorithm described in [14] to successfully fix all side effects of 
concurrent DUs on these queries before passing the corresponding query results 
up to the next layer of the system. This results in the concurrent DUs to be 
transparent to all modules m. DW management subspace. By using the local 
correction algorithm, there is no possibility to be faced with an infinite wait 
due to the recursive correction of queries, as is a problem in the Strobe solution 
[16]. Details of the local correction scheme we adopt for DyDa and its proof of 
correctness are beyond the scope (and space limitations) of this paper, but can 
be found in [12]. 



Using Name Mapping to Handle Maintenance-Concnrrent RenameSCs 

In addition to the above service, features based on a temporary name mapping 
table have been added to the query engine of DyDa in order to handle concurrent 
SCs of type RenameSC. Whenever there is a query result received, the QE will 
check whether there is a concurrent renameSC received in the UMQ. If there 
is no concurrent renameSC, QE will directly return the result. If there is one 
concurrent renameSC that makes the query broken, QE will rename the query 
based on the renameSC and resent it again until succeed and keep the renam- 
ing transparent to the higher layer. More comprehensive discussion of handling 
renameSC at QE can be found in [13]. 

If more than one SC is found in the UMQ, and at least one of them is not 
RenameSC (i.e., at least one DropSC exists), then the QE cannot handle the 
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broken query anymore. It will report an error message to the data warehouse 
maintenanee subspace. If the modified query based on the RenameSC breaks 
again due to another RenameSC in the UMQ, the QE module will modify the 
query and resubmit it until either the query succeeds, or a DropSC is encountered 
in the UMQ. In the highly unlikely case that the ISs were to continuously issue 
new RenameSCs, then the QE would have to endure a long processing time. 

4.2 High Level Concurrency Control at DW Management Subspace 

Finally, the VS, VM and VA modules in the DW management subspaee finally 
need to handle the concurrent DropSC problem. VS has the least effect of the 
concurrent DropSCs, since it will not send queries to QE as shown below. VM 
module is affected by the concurrent DropSCs, but doesn’t have the capability to 
handle them as shown below. VA module is the module designated to handling 
the concurrent DropSCs as shown in [13]. 



The VS Module and Maintenance-Concurrent SCs. From Definition 1, 
we know that VS will never have any trouble with concurrent SCs, because it 
will not send any queries down to the IS space. However, the VS module needs to 
provide information to the VA module to help VA to adapt the view extent under 
such concurrent SCs (Figure 4). From the state diagram (Figure 1) in Section 
2, we can see that for every SC, the VS module will be called to generate a new 
version of the view definition to account for this SC. 

From the point of view of VS, all SCs happen sequentially. If two SCs come 
to the middle space at the same time, they will be assigned a random handling 
order. ^ In a distributed environment as we are assuming, there is no convenient 
way to determine when the two SCs happened at the IS space relative to each 
other (unless they both come from the same IS). Plus they may indeed have 
truly occured at the same time. The issue of which of the SCs happened first 
for two different SCs from two autonomous ISs does not affect the correctness 
of the VS. 

As further detailed in [13] after we apply VS, the VA module will know how 
each view definition evolved and thus is capable of generating the corresponding 
view adaptation queries for it. If VS drops a view as being no longer salvageable, 
i.e., empty extent, then VA doesn’t need to adapt it. 



The VM Module and Maintenance-Concurrent SCs. If VM encounters a 
concurrent SC, it recognizes that by the fact that it will receive a broken query 
passed up from the QE module. Because the VM algorithms in the literature 
haven’t considered thus far how to handle broken queries, we put the responsi- 
bility of this handling on the VA module that needs to be redesigned anyway. 

^ If two SC changes come to the middle space of DyDa framework at same time, we will 
pick a preferred order of handling the SCs based on a quality-cost model described 
in [4]. The decision of the handling order of a set of SCs is out of the scope of this 
paper. 
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Thus in our system existing VM algorithms from the literature are kept intact. 
Hence the VM algorithm simply stops (abnormally) and sends the DU that it 
is currently handling to the VA module. That DU will then be handled later 
by the VA algorithm when it is adapting the extent of the view as discussed 
in detail in [13]. Given that we isolate the VM module from the handling of 
maintenance-concurrent SCs, all the VM algorithms, e.g., PSWEEP [12], 
SWEEP [1], Strobe [16], and EGA [15] could be used in DyDa. 

4.3 Effect of Maintenance-Concurrent Updates on Existing DW 
Management Algorithms 

In the DyDa framework, we decide to let the query engine (QE) module as ex- 
plained in Section 3 fix the problem of any maintenance-concurrent DU be- 
fore the query results reach the DW management modules. So, the maintenance- 
concurrent DUs have no effect on the three modules VA, VM and VS. How- 
ever, the three modules have a different degree of awareness of maintenance- 
concurrent SCs, namely: 

— There is no concept of maintenance-concurrent SCs for the VS module, 
because the VS module never sends any query down to the IS space. 

— While the VM module will send queries down to the IS space for view main- 
tenance, it assumes the view definition that it is maintaining will not be 
changed. So if a maintenance query is broken by a maintenance-concurrent 
SC, the VM module has to be reset by the DyDa system so to work with 
the newly updated view definition that has been generated by VS to take 
care of that maintenance-concurrent SC. VM itself however can function 
unchanged in our environment. 

— The VA module also sends down queries to the IS space to adapt the view ex- 
tent. If a view adaptation query is broken by a maintenance-concurrent 
SC, then VA needs to handle this in its adaptation process. Hence, a com- 
pletely new VA algorithm has been designed for DyDa [13]. 

5 Conclusions 

In this paper, we first identify the broken query problem of DW management 
under concurrent DUs and SCs. Then, we propose the DyDa solution frame- 
work that solves the problem using a two layered architecture as presented in 
the paper. To our knowledge, our work is the first to address the data warehouse 
maintenance problem under fully concurrent DUs and SCs of ISs. DyDa over- 
comes the limitation of the only previous approach by dropping its restrictive 
assumption of cooperation of the ISs [14] We are in the process of implementing 
this DyDa data warehouse solution In the future, we plan to run experimental 
studies to assess performance aspects of this solution. 

Acknowledgments. We would like to thank all DSRG members, in partic- 
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Abstract. The maintenance of data warehouses (DWs) is becoming an 
increasingly important topic due to the growing use, derivation and in- 
tegration of digital information. Most previous work has dealt with one 
centralized data warehouse only. In this paper, we now focus on envi- 
ronments with multiple DWs that are possibly derived from other DWs. 
In such a large-scale environment, data updates from base sources may 
arrive in individual data warehouses in different orders, thus resulting in 
inconsistent data warehouse extents. We propose to address this problem 
by employing a registry agent responsible for establishing one unique or- 
der for the propagation of updates from the base sources to the DWs. 
With this solution, individual DW managers can still maintain their 
respective extents autonomously and independently from each other, 
thus allowing them to apply any existing incremental maintenance al- 
gorithm from the literature. We demonstrate that this registry-based 
coordination approach (RyCo) indeed achieves consistency across all 
DWs. 



Keywords: Distributed Data Warehousing, View Maintenance, Registry. 

1 Introduction 

Data warehousing [14, 6, 7, 15, 11, 10] (DW) is a popular technology to integrate 
data from heterogeneous information sources (ISs) in order to provide data to, 
for example, decision support or data mining applications [2]. Once a DW is 
established, the problem of maintaining it consistent with underlying ISs under 
updates remains a critical issue. It is popular to maintain the DW incrementally 
[1, 9, 8] instead of recomputing the whole extent of the DW after each IS update, 
due to the large size of DWs and the enormous overhead associated with the DW 
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loading process. The majority of such view maintenance algorithms [1, 17, 16] 
as of now are based on a centralized DW system in which materialized views are 
stored in a single site even if bases may be distributed. 

Given the growth of digital on-line information and the distributed nature 
of sources found on the Internet, we can expect large collections of interrelated 
derived repositories instead of just one individual DW. Such a distributed en- 
vironment composed of data warehouses defined on top of each other is also 
emerging as a design solution for a growing set of applications [13]. In this pa- 
per, we focus on the view maintenance in such distributed environments. 

1.1 Related Work 

Zhuge et al. [18] proposed an architecture for handling multiple views consistency 
for several views specified in a single data warehouse. Stanoi et al. [12] proposed 
a weak consistency maintenance approach that processes the cumulative effect 
of a batch of updates. In their latter work [13], they proposed an incremental 
algorithm for maintenance distributed multi-view DWs. Their approach requires 
storage for a stable state with table of changes. A stable state is a snapshot of 
a safe state in the view’s history such that the view will not need to answer 
update queries based on a state prior to this safe state. The table of changes 
includes all the updates following the safe state of the actual materialized view. 
There is a dependency list appended to all entries in the table of changes [13]. 
A materialized view is refreshed only when the new state is considered safe, i.e., 
when the respective updates are in sequence, and have been integrated by all 
the corresponding direct descendents. Hence, DWs always need to wait for their 
corresponding direct descendents. In our approach, we don’t need any DW to 
keep a safe state table. In fact, our solution is able to eliminate the possibility 
of an infinity wait. Our approach has good performance for both a fiat or a tall 
topology of DW dependencies and hence is scalable. 

1.2 Our RyCo Approach 

In this paper, we propose a registry-based coordination solution strategy that 
is able to coordinate the maintenance of multiple data warehouses in a non- 
intrusive manner by establishing a unique order among all base update notifica- 
tions for the environment. DW managers exploit this order synchronization to 
determine how and when to process incoming update notifications and to com- 
mit their respective extent updates, instead of blindly making use of the order 
in which the update messages are received from the other DWs. We demon- 
strate that this method, which we call RyCo for registry-based coordination for 
distributed data warehouse maintenance, will indeed achieve consistency across 
all data warehouses with little overhead. In summary, RyCo has the following 
advantages: 

— First, each DW is maintained by a separate DW manager and can be updated 
independently from other DWs. 
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— Second, any DW maintenance algorithm for a central DW can be adopted 
from the literature and plugged into our system. 

— Third, all DWs are guaranteed to be maintained consistent with each other 
and the ISs at all times. 

— Fourth, unlike previous work [13], our approach avoids any requirement for 
a safe time before one can refresh a DW. 

Outline. The rest of the paper is organized as follows. The system definitions, 
assumptions and view consistency levels are presented in Section 2. Section 3 
presents the architecture and algorithm of the registry-based solution. Finally, 
we conclude in Section 4. 

2 Distributed Data Warehousing Environments 

2.1 View Dependency Graph 

A distributed data warehousing environment (DDWE) is composed of multiple 
possibly interrelated DWs. Individual DWs are independent running possibly on 
different sites. Each DW has its own views. Views may be defined on views of 
its own DW, on other views, on base relations in ISs, or on a combination of the 
above. For simplicity, we assume each DW only has one view and each IS has 
one base relation. We can use the MRE wrapper proposed in [4] to release the 
later assumption. 

Definition 1. A view V is defined as V = 5i N 52 IX ... N where Si (1 < 
i < n) eould he either a base relation or a view. We say 5, is a parent of the 
view V and V is the direct descendent of 5*. The bases B\, from 

whieh V is ultimately derived, i.e., all aneestors of V that are bases, are ealled 
base-ancestors of V. If a base B is a base-aneestor of both a view Vi and a 
view V2, then we say B is a common base-ancestor ofV\ and V2- 

A view-dependency graph [3] represents the hierarchical relationships among 
the views in DWs and bases in ISs. The dependency graph shows how a view is 
derived from bases and/or other views. 

Definition 2. The view dependency graph of the view V in a DDWE is 
a graph G„ = (V, E) with N the set of bases or views that are aneestors of V 
ineluding V itself and E the set of of direeted edges Eij = (Ni,Nj) for eaeh Nj 
that is a direet deseendent of Ni, where Ni,Nj € N. All nodes Ni are labeled by 
the name of the bases or views that they represent. 

Example 1 . Suppose there are two base relations B\, B2 in two ISs and three 
views Vo, Vi and V2 at three DWs that reside on different sites, defined as 
Vq = Vi IX V2, Vi = B\ IX B2 and V2 = B\ IX B2. 

Figure 1 depicts the view dependency graph for the views. For example, views 
Vi and V2 are parents of the view Vq. Bi and B2 are ancestors of Vq. Vq is the 
direct descendent of Vi and V2. The base-ancestors of Vq (as well as Vi) are 
Bi and B2 ] while Bi and B2 are also common-bases ancestors of Vi and V2. 
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Descendent 



Ancestor 



Fig. 1. A Distributed DW System Example 



2.2 Base Consistency and View Consistency 

Consider a base B that executes a sequence of atomic data updates. These 
changes result in a sequence of states. We use the notation Stm{B) for the m’th 
state in this sequence. 

Definition 3. Let a view V he defined by Definition 1. V is said to be base con- 
sistent if eaeh DW state ofV, StfiV), eorresponds to one single state Stj(Bk) 
for eaeh base B^ with k from 1 to m. 

Based on Definition 3, a view is base consistent if at all times its extent 
reflects at most one real state of each IS. V cannot reflect two different states of 
a base Bj . 

Definition 4. Two views Vi and V 2 are view consistent with eaeh other if 
V\ and V 2 are both base eonsistent and after eommitting eaeh data update (AB) 
from any of the eommon bases B at their respeetive sites, states StfiVi) and 
Stj(V 2 ) eorrespond to the same state Stm,(B) of that eommon base B. 

2.3 View Maintenance in Distribnted Data Warehonse Systems 

To distinguish view updates caused by different data updates, we use a data 
update identifier in the rest of the paper, in short DU-id, to identify DUs from 
different ISs. For simplicity, we use an integer to denote DU-id with the first 
digit of the DU-id the same as the IS index number ^ and rest of DU-id is the 
index of DUs from that IS. For example, ASi/kj denotes the ASi calculated for 
the jth update from Bf,. 

Assume the view V is defined as in Definition 1 and the views 5, (for 1 < i < 
n) have multiple common bases B^. If there is any base data update from such 
a common base Bf,, denoted as AB^/kj, the views 5i, ..., 5„ and V need to be 
maintained view consistent with each other. Then all ASi/kj (i from 1 to n) 
are calculated where some of the ASi/kj could be null if B/. is not a base of the 
view Si, and sent to the view V. If all ASi/kj only contain the effect of the data 
update of AB^/kj for the base Bj., we can calculate the new view extent V as 
follows: 



^ If there were a lot of bases, we could use more digits to represent base index number. 
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V + AV = (Si + ASi/kj) N ... N ... N (Si + ASi/kj) N ... N (S„ + ASJkj) (1) 
Hence, the AV expression is shown as follows: 



AV = ASi/kj N 52 N ... N 5j N ... N 5„ (2) 

+ (5i + AS'i j kj) AS 2 / kj 53 Si ... 5|2 + . .. 

+ (S'! + ASi/kj) N ... N (5j_i + ASi-i/kj) N ASi/kj N 5,+i N ... N 5„ + ... 
+ (5i + ASi/kj) N ... N {Si + ASi/kj) N ... N (5„_1 + ASn-i/kj) N AS^/kj 

However, we note that multiple data updates may have happened concur- 
rently at different bases. These DUs may arrive at different views in a different 
order. Some of the ASi/kj may not only contain the effect of AB^/kj but in ad- 
dition they may also incorporate the effect of some other DUs. In other words. 

Si and Sj {i ^ j) are not guaranteed to be view consistent with each other. 
Hence, view V derived from 5, (1 < * < n) is not base consistent. 

To maintain view V base consistent by Definition 4, we need to update the 
view V based on the update messages from all its parents 5i, S 2 , ... 5„ that 
reflect the same state of the same base. That is all the ASi/kj (i from 1 to n) 
must reflect the same states of all the bases Hi, ..., B^. We can’t control that 
Si and Sj ( i j) always reflect the same base extents because 5, and Sj belong 
to different DWs. But to maintain a view V correctly, we first need to assure all 
its parents 5, to be view consistent with each other. 



3 The Registry-Based Coordination for Distributed DW 
Maintenance 

3.1 View Maintenance Order and Consistency 

If DWs are maintained independently from other DWs, then they may not be 
consistent with each other. It is difficult to maintain a view that is defined on 
top of already inconsistent views to be base consistent. 

Definition 5. The order in whieh a DW reeeives data updates is ealled receive- 
message-order, and the order in whieh a DW updates its extent update- 
message-order . 

Lemma 1. Given a set of views Vi that are view eonsistent with eaeh other. 
Then any deseendent view V defined direetly on top of these eonsistent views Vi 
ean be maintained as base eonsistent and view eonsistent with Vi. 

DWs could have a different receive-message-order (RMO) and no knowledge 
of the receive-message-order of other DWs. If DWs were to be maintained not 
based on their respective receive-message-order but rather on the same enforced 
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update-message-order(UMO), then the DWs would be updated and would re- 
flect base changes in the same order. Hence DWs are consistent with each other 
and thus are base consistent. 

To generate such a unique update-message-order, the key idea is to use the 
registry agent. All DWs and bases are registered with the registry. The registry 
generates a unique update-message-order and sends it to all DWs in the system. 
All the DW extents are updated by this update-message-order and hence all 
DWs are then consistent with each other. 

3.2 System Architecture of RyCo Approach 

A registry-based coordination (RyCo) framework for distributed DW mainte- 
nance is composed of three types of components as depicted in Figure 2. First, 
the registry agent is used to generate a unique update order. Second, one wrapper 
at each base sends DU-ids and data updates to related views. It also processes 
queries sent by direct descendent. Third, the mediator of each data warehouse 
maintains the data warehouse based on the updates in the order decided by the 
registry. Figure 2 depicts the relationships between these three components. As 
we can see there is one single registry that connects to multiple base wrappers 
and mediators of different DWs. 




Fig. 2. Overall Architecture of RyCo System 



Figure 3 shows the simple structure of the registry. The registry keeps the 
information of registered DWs in the DW list. It can further add or remove a 
DW from its DW list when a DW is added or removed from DDWE. It receives 
DU-id from all bases and orders those IDs based on their receive order and then 
forwards those IDs in that unique order to all DWs in the system. The bases 
as shown in Figure 4 only send a DU-id to the registry, as there is no need to 
send real update messages. The base will still send data update messages with 
DU-ids to all their direct descendent DWs as before and also answer all possible 
queries from its direct descendent. 

In Figure 5, a DW mediator consists of the VM (View Manager) and the 
Query Processor. The VM has the same function as in centralized DW sys- 







1 10 L. Ding, X. Zhang, and E.A. Rundensteiner 




Fig. 3. Structure of Registry Fig. 4. Structure of Base Wrapper 



To Descendent Views Query QR 




DU from DU-id from Query QR 

parent Registry 



Fig. 5. Structure of DW Mediator 



terns. The VM processes one update at a time and updates the DW view to be 
consistent with the base after this update. Any incremental view maintenance 
algorithm from the literature, e.g., SWEEP [1] could be employed as VM here 
with minor modification to that the new extent is computed based on the or- 
der from the update-message-order, and not the receive-message-order when the 
updates are actually propagated upward through the dependency graph. 

A DW mediator in the system has two queues, namely the receive-message- 
queue (RMQ) and update-message-queue (UMQ). The RMQ buffers update 
messages received from its parents in the receive-message-order. The UMQ is 
responsible for buffering ordered DU-id messages from the registry. Views are 
maintained and updated in the order of the UMQ as follows. Eirst, VM removes 
the next DU-id from the head of the UMQ and checks whether the DU-id is 
related to this view. Because the registry will send all DU-ids to all the views, 
some of them may not be related to this view. If the DU-id is not related, VM 
will send a empty update ^ with the same DU-id to all direct descendent DWs. 
Otherwise, VM will wait for all updates with this DU-id from all its parents, 

^ That means no update for this specific base DU. 
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and then incrementally calculate AV for those data updates using the Equation 
2. Then, the VM updates its local DW with AV. As last, the mediator sends 
the update with the same DU-id to all direct descendent DWs. 

The Query Processor component is used to process any query sent from its 
children. Its functionality is similar to an IS query processor in centralized DW 
systems. 

We can observe that, using the registry, all DWs in our distributed data 
warehousing environment incorporate the DUs in the same order. Hence they 
are consistent with each other and all views are base consistent. More details on 
the RyCo solution, including formal theorems and some cost evaluation, can be 
found in our technical report [5]. 




Fig. 6. RyCo View Maintenance Example 



3.3 Illustrating Example of RyCo Process 

The example shows how the registry is used to enable view maintenance in 
distributed data warehousing environments. The pseudo-code of the RyCo algo- 
rithm is given in the technical report [5]. 

Example 2. We use the view definition in the example 1. Assume there is a data 
update at Hi denoted by ABi/11 and a data update at B 2 denoted by AB 2 / 2 I. 
These two DUs are then sent to views Vi and ¥ 2 - Assume that data updates 
ABi/11 and AB 2 / 2 I arrive at the views Vi and V 2 in different orders. 

Without registry, view maintenance for Vi computes and updates Vi with 
AVi/11 for ABi/11 and then AVi/21 for AB 2 I 2 I based on the RMO. However, 
the view manager of V 2 computes and updates V 2 in a reverse order, i.e., AV 2 / 2 I 
for AB 2 / 2 I first and then AV 2 /II for ABi/11. V\ sends its update messages 
AVi/11, AVi/21 and V 2 sends AV 2 / 2 I and AV 2 /H to Vq. We notice that AVi/11 
reflects the new state of Hi but not B 2 - But AV 2 /II reflects both the new state 
of Hi and H 2 . When Vq computes its new extent to reflect the update from 
base Hi based on the update messages AVi/11 and AV 2 /II, its new extent 
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incorporates aspects of two different states of B 2 , i.e., the B 2 before and after 
update AB 2 I 2 I. Hence, Vq is inconsistent with base. 

With our registry agent, we can solve this problem easily. Assume that the 
registry receives DU-ids in the order of 11 and 21. Then the registry forwards this 
unique update-message-order to all views, namely Vi, V 2 and Vq. The new view 
extents of Vi and V 2 are calculated and updated in the update-message-order. 
Then AVi/ll, AVi/21, AV 2 /II, and AV 2 / 2 I are sent to the view Vq in the order 
of 11, 21. AVi/11 and AV 2 /II only have the effect of the data update ABi/11. 
AVi/21 and AV 2 / 2 I have the effect of both ABiJll and AB 2 I 2 I. The view Vq 
is then updated in the same update-message-order. That is, it first incorporates 
the effect of ABi/11 based on AVi/11 and AV 2 /II , then incorporates the 
effect of AB 2 / 2 I based on AV 2 / 2 I, AV 2 / 2 I. Because views Vi, V 2 and Vq are 
updated by the same update-message-order generated by the registry, the views 
are consistent with each other and thus also are base consistent. 

4 Conclusion 

In this paper, we propose the RyCo approach for consistent view maintenance 
in distributed data warehouse environment. All the views in our system are 
maintained and updated independently according to the notification order from 
the registry. Our registry-based solution does not need any safe time to refresh 
a materialized view, as required by the only alternate solution in the literature 
thus far [13]. 

In the future, to make the system more scalable in terms of the number of 
bases and views in the system, we plan to optimize RyCo by dividing the system 
into different clustered groups, called DW groups (DWG) [5]. Each group is 
equipped with its own dedicated registry. Such a distributed registry approach 
could still keep the DWs consistent with each other while we reduce the unrelated 
and empty messages sent by the registry and handled in each DW group. 
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Abstract. The problem of incremental view maintenance in material- 
ized data warehouses has been studied extensively for relational select- 
project-join (SPJ) views. Many new data sources, however, are highly 
irregular and views often perform complex restructuring operations. 

This paper describes WHAX (Warehouse Architecture for XML), an ar- 
chitecture for defining and maintaining views over hierarchical semistruc- 
tured data sources with key constraints. The WHAX model is a variant 
of the deterministic model [5], but is more reminiscent of XML. The 
view definition language is a variation of XML-QL and supports selec- 
tions, joins, and important restructuring operations such as regrouping 
and aggregation. The incremental maintenance is based on the notion of 
multi-linearity and generalizes well-known techniques from SPJ-views. 

1 Introduction 

XML has become an important standard for the representation and exchange 
of data over the Internet. As an instance of semi-structured data [1], which was 
initially proposed as a methodology in the Tsimmis project [7], it can also be 
used for data integration [8]. That is, data sources are represented in XML; 
transformations and integration are then expressed in one of several XML query 
languages that have been proposed (e.g. [12,9,8]) to create an XML view. 

Once the view has been materialized, it must be maintained as updates to 
the underlying data sources are made. This problem has been studied for SPJ- 
views in the relational model (see [15] for a survey) and for object-oriented 
databases [16, 13]. SPJ-views have an important property: They are distributive 
with respect to relational union. For example, J?i ixi (J?2 U AR2) = (J?i ixi 
R2) U (J?i IXI AR2) holds for any relations J?i and R2 and some new tuples 
AJ?2- To compute the view J?i ixi (J?2 U AR2), the result of J?i ixi AR2 must be 
added to the existing view J?i ixi J?2. This is ususally much more efficient than 
reevaluating the view. 

More general, SPJ views are functions /(J?i , ..., J?„) of (not necessarily dis- 
tinct) base relations J?i, ..., that are multi-linear with respect to union: ^ 
f {R\, Ri U ARi, Rn) = f{R\,...,Ri,...,Rn) U /(J?i, ..., AJ?,, ..., J?„). 
Multi- linearity is a powerful concept. It allows efficient view maintenance for 
bulk updates and is also fundamental for other optimization techniques, such as 
parallel query evaluation and pipelining. 

* This research was supported in part by DOE DE-FG02-94-ER-61923 Sub 1, NS- 
F DBI99-75206, NSF IIS98-17444, ARO DAAG55-98-1-0331, and a grant from 
SmithKline Beecham. 

^ We slightly misuse the notion of multi-linearity: To be multi-linear, U must be a 
group operation with an inverse function similar to the definition in [13]. 
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<Conf name="STACS" year="1996"> 

<Publ> <Title> Views </Title> 

<Author> Tim </Author> 

<Author> Peter </Author> 

<Pages> <From> 117 </From> 

<To> 127 </To> 

</Pages> 

</Publ> 

<Title> Types </Title> 

<Author> Tim </Author> 

<Pages> <From> 134 </From> 

<To> 146 </To> 

</Pages> 

Fig. 1. A Publication Database in XML 

Example. Fig. 1 shows an XML document with conference publi- 
cations. Each publication has a title, a list of author names and belongs to a 
conference. Each author name refers to a person. 

In the semistructured model, nodes are identified by object identities and 
the update types are edge insertion, edge deletion, and value modifieation [2, 19]. 
Updates based on object identities, however, make it difficult to reason about 
what parts of the view have been affected. 

To illustrate, consider the following simple view: store all authors who pub- 
lished in STACS'96. Eurthermore, consider the insertion of a new author into 
some publication with OID p. Since it is not directly observable whether p be- 
longs to STACS'96 or not, it is not clear whether the author should be inserted 
into the view. Existing algorithms [2, 19] therefore require auxiliary data struc- 
tures to represent the relationship between view objects and those in the source. 
These complex data structures can be large and expensive to maintain. Another 
drawback of these approaches is that each atomic update causes queries to the 
source to update the view, which is inefficient for large update sequences. 

The WHAX Approaeh. The Warehouse Architecture for XML (WHAX) is a new 
architecture that combines the power and simplicity of multi-linearity with the 
flexibility of the semistructured data model. The WHAX model is similar to the 
null-terminated deterministic model [5] and is also closely related to LDAP [18]. 
WHAX is based on the concept of loeal keys: Each outgoing edge of the same 
parent node is identified by some locally unique key. Hence, each node in the 
tree is uniquely identified by the path from the root. 

Contributions. The query language WHAX-QL, based on XML-QL [12] gen- 
eralizes SPJ-queries and allows powerful restructuring through regrouping and 
aggregations. We present a restriction of WHAX-QL that is provably multilinear, 
and hence allows efficient incremental maintenance. Eor deletion updates, we de- 
velop an extension of the counting technique used for SPJ-views. This technique 
can also be used for view definitions involving aggregation. 

Limitations. In contrast to XML, the WHAX data model is unordered. Ini- 
tial ideas for view maintenance of ordered data are briefly described in Sec. 7. 



<Publ> 

</Publ> 

</Conf> 

Motivating 



<Person> 

<Name> Tim </Name> 
<Age> 35 </Age> 
</Person> 

<Person> 

<Name> Peter </Name> 
<Age> 45 </Age> 
</Person> 
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Furthermore, the WHAX model is tree-based and does not support navigation 
through graph structures as in the semistructured data model. 

The rest of the paper is organized as follows. Sec. 2 describes the WHAX data 
model, followed by the view definition language in Sec. 3. We describe the multi- 
linearity property for WHAX-views in Sec. 4 and extend the incremental view 
maintenance technique to deletions in Sec. 5. Sec. 6 describes how aggregate 
queries can be efficiently maintained in WHAX. We conclude in Sec. 7. 

2 The WHAX Data Model 

A WHAX data value is an unordered edge-labeled tree. The tree is deterministic 
in that the outgoing edges of a node have different local identifiers. Let B be the 
set of all base values (strings, integers, ...) and V be the set of WHAX-values. 
Local identifiers are pairs l{k) with label I € B and key k £ V. The set of all 
WHAX-values is defined recursively as the (minimal) set of finite partial functions 
from local identifiers to trees: V = {B x V) V . 

Values are constructed using {li{ki) : /„(A:„) : {...}}. If ki in Ifiki) 

is the empty partial function {}, then we write /*. If ki is a tree construct {...}, 
then we write /*(...) instead of /,({...}). For example, the value {Conf(@name : 
{STAGS : {}},@year : {1996 : {}}) : ...} represents conference STACS’96. Single- 
ton trees of the form {str : {}} occur frequently and are abbreviated as double- 
quoted literals “str” . Hence, the value above is equivalent to {Conf(@name : 
“STAGS", ©year : “1996") : ...}. 

Fig. 2 shows the WHAX-value for the data in Fig. 1. We adopt the XPath[10] 
convention and proceed attribute labels with The Age label forms an identi- 
fier with empty key {}. Recall that “str” represents singleton function {str : {}}. 

Deep Union. The deep union operation [5] of two WHAX-trees vi and V2 matches 
the common identifiers of vi and V2 and recursively performs deep union on their 
respective subvalues. The edges that only occur in m (or V2), but not in V2 {vi, 
respectively) are copied into the resulting tree:^ 

Vi 1 +) V 2 ::= {l{k) : Si 1 +) S2 | l{k) : Si G Vi,l{k) : S2 G ^2} U 
{l{k) : s I l{k) : s G v\,l{k) ^ dom(u2)} U 
{l{k) : s I l{k) : s G V2,l{k) ^ dom(m)} 
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Fig. 3. Deep Union 



Fig. 3 shows how the shoesize and address value for Tim can be added to a 
database with Tim and Peter. 

Relational Databases in WHAX. There is a natural translation from relational 
databases into WHAX: Each tuple is represented by an outgoing edge from the 
root. The relation name R and the key k of the tuple form the local identifier 
R{k). All non-key attributes form a subtree under R{k). 

3 Defining Views in WHAX 

Over the past few years, several query languages for semi-structured and XML 
data have been proposed [12,9,3,1]. Our language, called WHAX-QL, differs 
from XML-QL in that local identifiers (i.e. labels and keys) can be matched 
against patterns. We start by illustrating WHAX-QL through some examples. 

View V\: Select the name and the age of all authors older than 36: 

Vl($db) = where <Person($n) .Age> $a </> in $db, 

$a > 36 

construct <MyPerson($n) .Age> $a </> 

The path expression Person($n).Age identifies the person’s age in the database 
$db and binds the person’s name to $n. The subtree under Age is bound to 
variable $a. Recall that the age is represented as a single outgoing edge from the 
Age node. To satisfy $a > 36, the value of $a must be of the form {I : {}} where 
I is an integer with I > 36. 

View V 2 : For each author, return the title, conference, and page numbers of 
publications: 

V2($db) = 

where <Conf (@name : $n,@year : $y) .Publ (Title : $t) > 

<Author($a)> </> 

<Pages> $p </> 

</> in $db 

construct <Author (Name : $a) .Publ (Title : $t , Conf : $n, Year : $y) . Pages> $p </> 

Variable $a binds to any author name and $p is bound to the pages of the 
publication identified by $n, $y, and $t. The view is a tree with authors at the 
root and their publications at the leaves. 

It is always possible to “unnest” nested variable bindings. For example. View 
V 2 could be equivalently written as: 

^ dom(w) is the set of local identifiers in v: dom(w) := {l{k) \ l{k) : s 6 w}. 
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Conf(@name:"STACS' 

@year:"1996") 



Publ(Title:’’ Views") 




AuthorC'Tim") 



Fig. 4. Regrouping result for view 




e ::= ” str” | $a; | ei op 62 | ei l±) 62 | Q | (iPat(ei) : e\, LPat„{e„) : e'„) 
Q ::= where <PPati> </> in $di, ... 

<PPatm> </> in $dm, 
condi , ..., cond„ 

construct <PExpr^> ei </>, ..., <PExprp> 6p </> 

PPat ::= LPati{KPati) LPatn{KPatn) 

LPat ::= / I $a; 

KPat ::= $a; | (li(vi) : KPat\, ...,ln{yn) '■ KPatn) 

PExpr'.'.= LPati{ei) LPatn(^n) 



Fig. 5. Syntax of WHAX-QL 



V’2($db) = 

where <Conf (Sname : $n,@year : $y) .Publ (Title : $t) .Author ($a)> </> in $db, 

<Conf (@name : $n,@year : $y) .Publ (Title : $t) .Pages> $p </> in $db 
construct <Author (Name : $a) .Publ (Title : $t , Conf : $n, Year : $y) . Pages> $p </> 

As in XML-QL, multiple occurrences of the same variable represent the same 
value. In the example above, this ensures that author $a and pages $p belong to 
the same publication. 

Example View V 3 : For each person, return their age and all STAGS publications. 
Furthermore, group all publication by their year: 

V3($db) = 

where <Person(Name : $n) . Age> $a </> in $db, 

<Conf (Oname : "STAGS" ,@year : $y) . Publ (Title : $t) . Author ($n)></> in $db 
construct <Author (Name : $n) . Age> $a </>, 

<Author (Name : $n) . STAGS (Year : $y) .Title ($t) > </> 

This query performs a join between persons and authors over variable $n. Fig. 

4 shows the regrouping effect of this query. 

The Syntax ofV\l\-\AX-QL (Fig. 5). The where-clause of Q describes the variable 
bindings. For each valuation of the variables in the where-clause, the construct- 
clause is evaluated and the results are deep-unioned together. 

The identifiers ..., $d„ in the where-clause denote WHAX data sources 
and are called base variables. A path pattern PPati matches a path in $dj against 
a given sequence of label patterns LPat and key patterns KPat, separated by dots: 

LPati{KPati) LPatn(KPatn). The value at the end of the path is bound to 

variable Sz*. As in XML-QL, variable Sz* can be omitted if it is not used. 
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A label pattern LPat is either a constant label I or a label variable Sz, 
and a key pattern KPat is either a variable $x or a complex pattern (/i(vi) : 
KPati, ...Jni^n) ■ KPatn)- It matches a WHAX-tree if the tree has exactly n 
elements with distinct local identifiers /i(vi), Ini'^n) and the element values 
match patterns KPati, KPat„, respectively. If KPat (or v) in LPat(KPat) 
(/(v), respectively) is the empty value {}, then we write LPat {I, respectively). 

Variables in WHAX-Qi. We distinguish several types of variables by their first 
occurrence within a WHAX-expression. Parameters to the query (such as $db) 
are called parameter variables. Variables in path patterns PPati are bound to 
labels or key values and are called label variables and key variables, respectively. 
Variables Sz* at the leaves of path bindings are called value variables. 

Syntaetie Simplifieations Views Vi and V/ in ^6C. 3 illustrate how nested bind- 
ings can be eliminated. Similarly, we can eliminate the case where some value 
variable Sz* from <PPati> Sz* </> in $d6* is the base variable $dbj of some 
other pattern in the same where-clause: <PPatj> $xj </> in Sz*. The second 
pattern can be replaced by pattern <PPati.PPatj> $Xj </> in $d6j. 

4 View Maintenance through Multi-Linearity 

The multi-linearity law is the foundation for efficient incremental view main- 
tenance of bulk updates. For example, consider a multi-linear view V(Ri, Ri): 
V{RiUARi,Ri) = V{Ri,Ri)UV{ARi,Ri) tmAV{Ri,RiUARi) = V{Ri,Ri)U 
V{Ri,ARi). The updated view can be easily computed by inserting the result 
of query V{ARi,Ri) (or V{Ri,ARi)) to the view V{Ri,Ri). 

Unfortunately, WHAX-QL queries are not necessarily multi-linear with re- 
spect to deep union in their given form. We identify two properties to make 
WHAX-QL queries multi-linear: First, the key variable eonstraint forbids the use 
of parameter and value variables as key and operand variables. Second, the base 
variable eonstraint forbids the multiple use of base variables. 

Key Variable Constraint. Recall example view Vi from Sec. 3, which accesses 
value variable $a in condition $a > 36. Consider the (rather unusual) insertion 
of a second age for some author. Then the condition will become false because 
$a is no longer a singleton tree. Hence, the view is not multi-linear. 

Similarly, consider the mixed use of parameter/ value variable $x as a key 
variable in a path pattern PPat (or path expression PExpr). The value of the 
parameter/value variable can change during an update and this will make PPat 
(PExpr) refer to a eompletely different value. Again, those views are not multi- 
linear. Therefore, we syntactically restrict WHAX-QL queries as follows: 

Definition 1. A WHAX-QL query is maintainable if no parameter /value vari- 
able $a: oeeurs as a key variable or operand of some base operation ei op ei. 

One can sometimes replace the query by a similar query that returns the same 
expected result. For example, view Vi can be transformed by binding $a to the 
label (i.e. the age value) of each Age edge: This multi-linear view has the same 
intended result. 



= where <Person($n) . Age . $a> </> in $db, 

$a > 36 

construct <MyPerson($n) . Age . $a> </> 



V”l($db) 
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Base Variable Constraint. View Vsi^db) in Sec. 3 uses variable $db twice as a 
base variable. One can observe that the view is not multi-linear for a database 
DB and update ADB (i.e. Vz{DB\*i ADB) ^ Vz{DB)\*iVz{ADB)), since some 
persons or authors in ADB might join with authors (persons, resp.) in DB. 

Fortunately, the view can be made multi-linear by replacing the second oc- 
currence of $db with a fresh variable $db’and adding $d6 as a parameter variable: 

V’3($db,$db’) = 

where <Person(Name : $n) . Age> $a </> in $db, 

<Conf (@name : "STAGS" ,@year : $y) .Publ (Title : $t) . Author ($n) ></> in $db’ 
construct <Author (Name : $n) . Age> $a </>, 

<Author (Name : $n) . STAGS (Year : $y) .Title ($t)> </> 

The new multi-linear view V^i^dbj^db') is equivalent to V3($d6), if applied to 
the same database: V^iDB.DB) = Vz{DB). 

Theorem 1. A maintainable \NHAX-QL view V($di, ... is multi-linear in 
its parameters $di, ... if all base variables $db of the same where-construct- 
expression are distinet and do not oeeur in the construct-c/aMse. 

Lemma 1. A maintainable view V($dbi , ...,$dbn) can always be transformed 
into an equivalent multi-linear view. 

Details of the rewriting process can be found in [17]. 

Relational SPJ Views and \NHAX-QL Views. Based on the mapping in Sec. 2, it 
can also be proven that WHAX-QL is a generalization of relational SPJ-queries: 

Lemma 2. Relational SPJ views can be expressed as maintainable WHAX queries. 

5 Deletions 

Multi-linearity cannot be used for propagating deletion updates. The reason 
is analogous SPJ-views: a tuple in the view may have multiple derivations from 
tuples in the base relations. Hence, it is unclear when the tuple should be deleted. 

Two approaches to deletions have been studied for SPJ-views: 1) View anal- 
ysis [6] and 2) view maintenance using multi-set semantics [14] or counting [4, 

15]. The view analysis algorithm in [6] is a static decision procedure which ac- 
cepts only views for which any tuple in the view is guaranteed to have exactly 
one derivation from the base relations. Alternatively, one can keep track of the 
number of derivations for each distinct tuple in the view [14,4, 15]. 

These techniques are based on invertible versions of the set union opera- 
tion: disjoint bag, or counting union. Intuitively, invertible union operation allow 
the efficient propagation of deletions since the following multi-linearity law for 
deletion can be derived: V{Di, ...,Di — SjDi , ..., D„) = V{Di, T>„) — 

V{Di,...,S7Di,...,D„). 

Unfortunately, the deep union operation in WHAX is not invertible. Further- 
more, the natural disjoint union operation for WHAX is complicated to handle 
(preliminary ideas can be found in [17]). Instead, we focus on the counting tech- 
nique in the next section. 
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^ Pu bljTitle: "Types")[4] 
Author("Tim")[l] 



Pages [2] 



Author("Ina")[l] 

From[l 

134[1] 




Fig. 6. Annotating a WHAX-tree with supports 



5.1 Counting in WHAX 

Similar to counting for SPJ-views [4,15], each edge with local identifier l{k) 
in the WHAX-tree is annotated with a count c that describes the number of 
derivations for the edge: l{k)[c]. We call count c the support of the edge. 

The tree on the left of Fig. 6 shows parts of the tree from Fig. 2 with supports. 
Leaf edges have support 1, and the support of the inner edges is the sum of the 
children’s supports (the indireet support oi the edge) and some (by default, zero) 
direet support for the edge itself. The indirect support of leaf edges is 0. 

Insertions and deletions are represented as trees with positive or negative 
supports. Fig. 6 shows how a new author “Ina” is added to publication “Types” 
and how a page number is changed from 146 to 147. Note that the change is 
modeled as a deletion (support -1) plus an insertion (support 1). The deep union 
operator 1+)^ for trees with counts is defined as follows: 

vi l±)c V2 ■■■■= {l(k)[ci + C2] : (si l±)c S 2 ) I l(k)[ci] : Si G Vi,l(k)[c2] : S 2 G ^' 2 , 

(ci + C2 0 V Si l+)c S2 {}) } U 
{l{k)[c] : s I l{k)[c] : s G vi,l{k)[...] ^ dom(u 2 )} U 

{l(k)[c] : s I l(k)[c] : s G V2,l(k)[...] f dom(m)} 

Note that merged edges are eliminated if their support is empty (ci + C 2 = 0) 
and if they do not have children (si 1+)^ S 2 = {}). 

The supports in the view must be carefully chosen to preserve the semantics 
of queries, as the following nested view illustrates: 

where <Person(Name : $n) > $p </> in $db 

construct <MyPers (Name : $n) > 

where <Age> $a </> in $p construct <Age> $a </> s </> 

The indireet support of path MyPers(Name:$n) in the view may be 0 if there is 
no Age edge in the base data. Its direet support must therefore be > 0, and is 
determined by the (direct+indirect) support of Person (Name: $n) in the source. 

Consider the syntax of where-construct-queries (Fig. 5). Let <j) denote a valu- 
ation for the variables in the where-clause and let PPati{(j)) {PExpr ^{(j))) be the 
instantiation of pattern PPati (path expression PExpr respectively) under 
this valuation. Let the support of path p, Supp{p), be the the support (direct + 
indirect) of the last edge in the path. The direct support of a path {DSupp{p}) 
is defined similarly. Then, the direct support of each output path PExprj{<f) is 
determined by the product of the supports of each of the input paths PPati'. 
V 1 < j < n : DSuppiPExprj {(/>)) = t[i<i<m Supp{PPati{4>)). 
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Lemma 3. Each multi-linear WHAX-Qi view V($di, ... $d„) is also multi- 
linear under counting semantics using counting deep union 1+)^- 

6 Aggregations in WHAX 

A simple extension of WHAX allows the use of aggregations. First, the syntax 
is extended such that expressions e, in the construct-clause of where-construct- 
queries (Fig. 5) can be any aggregate function sum(e), avg(e), count(), min(e), or 
max(e). The following view determines the number of pages published for each 
conference and year: 

where <Conf (Oname : $cn,@year : $cy) . Publ (Title : $t) .Pages> 

<From> $from </> <To> $to </> </> in $db, 
construct <Conf (Oname : $cn,@year : $cy) . SumPages> sum($to-$from+l) </> 

Aggregate values produced by the construct-clause are merged through aggre- 
gate deep union 1+)® with the following changes in the model: Leaf edges with 
annotation l\c\^ can be annotated with aggregate tags: aggr :: l\c\. For example, 
Count::2[2] represents a publication count of 2 (with support 2) and min::24[l] a 
minimum author age of 24. 

The deep union !+)„ over WHAX-trees v\ and with aggregates has the fol- 
lowing meaning: For any two matching aggregate elements aggr :: li[ci] € v\ and 
aggr :: l2[c2] G U2, a new edge with aggr :: l[ci C2] with I = aggr(/i , ^2, ci, C2) is 
created. The aggregate combinator function aggr(/i , ^2, ci, C2) is defined depend- 
ing on aggr: sum(/i, ^2, ci , C2) = /i -I- 12 , rnin(/i , ^2, ci , C2) = min(/i,/2), etc. Note 
that min and max are undefined for ci * C2 < 0, since min and max cannot be 
incrementally maintained under deletions. 

7 Conclusion and Future Work 

The WHAX-model is a hierarchical data model similar to the deterministic mod- 
el [5] and LDAP [18]. In contrast to previous work on view maintenance in 
semistructured databases [2,19], the WHAX architecture supports complex re- 
structuring operations such as regrouping, flattening, and aggregations. Fur- 
thermore, while existing work only considers atomic updates, our technique is 
applicable to large batch updates, which can reduce the refresh time of ware- 
houses considerably. The underlying multi-linearity principle is also important 
for other optimization problems, such a query parallelization and pipelining. 

Our approach can be extended with other view maintenance techniques, e.g. 
deferred view maintenance [11] or detection of irrelevant updates [4]. 

The WHAX model does not currently support ordered data structures. Or- 
dered structures are difficult to maintain if updated elements are identified over 
their position. Positions are “dynamic” in two ways: 1) positions of elements 
might change during updates, and 2) positions may be different in views. There- 
fore, a mapping between “dynamic” positions and “static” keys must be pro- 
vided. This mapping changes during updates and efficient index schemes (e.g. 
based on B-trees) are required. 

Acknowledgements: We would like to thank Peter Buneman and Wang-Chiew 
Tan for helpf ul discussions. 

® The key k and the value v in l{k) : v are always empty for aggregate edges. 
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Abstract 

Data warehouses usually store large amounts of information, repre- 
senting an integration of base data from different data sources over a long 
time period. Aggregate views can be stored as a set of its horizontal frag- 
ments for the purposes of reducing warehouse query response time and 
maintenance cost. 

This paper proposes a scheme that efficiently maintains horizontally 
partitioned data warehouse views. Using the proposed scheme, only one 
view fragment holding the relevant subset of tuples of the view is accessed 
for each update. The scheme also includes an approach to reduce the 
refresh time for maintaining views that compnte aggregate functions MIN 
and MAX. 

Keywords: Data Warehouse Applications, View Maintenance, 
Horizontal Partitioning, Performance Improvement. 



1 Introduction 

Different data sources are usually integrated into one large central fact table in 
the data warehouse holding the main integrated data. Foreign key attributes 
in the fact table may refer to dimension tables, which describe the dimension 
attributes in the fact table. Data warehouse views are derived from the fact 
table and may possibly join one or more of the dimension tables. Cells of an 
n-dimensional data cube hold data that can easily construct 2” subviews for an 
aggregate measure of interest. Breaking up a relation into its horizontal frag- 
ments would lead to a reduction in both query response time and maintenance 

’This research was supported by the Natural Science and Engineering Research Council 
(NSERC) of Canada under an operating grant (OGP-0194134) and a University of Windsor 
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cost. Horizontal fragmentation partitions a relation snch that each fragment 
contains only a subset of tuples of the relation that satisfy specific predicate 
conditions [2]. 

Data in a view may he deleted, updated, or new data can he inserted 
into a view with each update expressed as a delete-insert pair. These constitute 
refresh operations. Views can be maintained by computing the changes made 
to data sources and applying these changes to views to bring them up-to-date 
without accessing base data. This maintenance method is called incremental 
view maintenance. 

1.1 Related Work 

Harinayaran et al. in [3] uses a lattice framework to model the dependencies 
between subviews of the data cube, and also presents a greedy algorithm for 
selecting some of the views to store. Ezeife in [2] presents a scheme that stores a 
selected view as a set of its horizontal fragments to reduce query response time 
and maintenance cost. 

Work on incremental view maintenance techniques include [1, 4]. Colby 
et al. in [1] proposes to split the deferred maintenance work into propagate and 
refresh functions in order to minimize the batch time needed for maintenance. 
Mumick et al. in [4] shows how the above two functions can be derived for the 
aggregate views. 

1.2 Contributions 

This paper contributes by proposing a new view maintenance scheme for hor- 
izontally partitioned data warehouse cube views which leads to reduced view 
maintenance cost. The scheme handles all three types of refresh operations. In 
addition, this paper proposes a method to reduce the refresh time for views with 
aggregate functions MIN and MAX by eliminating the need to visit the base 
relations in order to compute a new minimum or maximum value when the cur- 
rent one is deleted from base table. The proposed new scheme and the schemes 
presented in [4] are implemented and compared. 

1.3 Outline of the Paper 

Section 2 gives an example to show how to maintain a horizontally partitioned 
view using the proposed approach. Section 3 formally presents the new scheme 
for maintaining horizontally partitioned data cube views. Section 4 discusses 
two experimental study cases to demonstrate the performance and benefits of 
the new scheme, while section 5 presents conclusions and future work. 



2 An Example 

The following is an example of a data warehouse for the service departments 
of a car dealer chain company. This company has many branches at different 
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Table 1: Car Dealer Fact Table DFT 





DealerlD 


ProviderlD 


ServiceTime 


Gain 




003 


02 


199904020030 


200 




001 


02 


199904020048 


300 




002 


01 


199904020049 


500 




002 


03 


199904020050 


400 




001 


01 


199904020068 


200 




001 


01 


199904020064 


50 




003 


02 


199904020067 


-200 




002 


03 


199904020070 


600 




002 


01 


199904030070 


200 




001 


02 


199904030080 


200 




001 


02 


199904030090 


100 



locations. The data warehouse has the following fact and dimension tables: 
DFT (DealerlD, ProviderlD, ServiceTime, Gain) 

Dealer (DealerlD, DealerName) 

Provider (ProviderlD, ProviderName) 

Time (ServiceTime, Month, Year) 

The attributes of the fact table DFT consists of a dealer’s branch ID, 
service provider’s ID, service time in minute, and the gain or profit of a single 
service. Each tuple in the fact table contains the gain of a single service provided 
by a service provider in a dealer branch at a certain time. If the gain is negative, 
it means that a customer is not satisfied with a provided service and has asked 
for a refund. The ServiceTime contains twelve digits, e.g., 199904020030. The 
first four digits indicate the service year, and the second four digits indicate 
the service month and day. The last four digits in ServiceTime represent the 
service minute of the day starting from 0 and ending with 1439. The table DFT 
(Table 1) contains sales information for the chain company from minute 0030 
April 2 to minute 0090 April 3, 1999. 

Assume that only the top-level aggregate view (DPM) is stored in the 
data warehouse. D stands for DealerlD, P stands for ProviderlD, and M stands 
for service Month. DPM records the monthly total service amount, minimum 
gain, and the total numbers of non-refunded services for each individual service 
provider in each dealer branch. In order to handle deletions in some situations, 
an SQL aggregate function Count(*) is added to the view. DPM has the fol- 
lowing schema: DPM (DealerlD, ProviderlD, ServiceMonth, Total, MinGain, 
TotalCount). The top view DPM is shown in Table 2. 

With the example, assume that DealerlD < 001, and PrcwiderlD > 02 
are the simple predicates SPi, SP 2 from queries accessing the warehouse view 
DPM, that the fragmentation process was based on. Minterm predicates are 
used to partition the view and the four minterm predicates created using these 
two simple predicates by obtaining their conjunctions in either their natural or 
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Table 2: DPM View (VI) 



DealerlD 


ProviderlD 


Service 


Total 


MinGain 


Count(*) 


001 


01 


199904 


250 


50 


2 


001 


02 


199904 


600 


100 


3 


002 


01 


199904 


700 


200 


2 


002 


03 


199904 


1000 


400 


2 



Table 3: Net-Change-DPM Table 



DealerlD 


ProviderlD 


Service 


Total 


MinGain 


Count f*) 


001 


01 


199904 


100 


200 


1 


001 


03 


199904 


400 


400 


1 


001 


02 


199904 


-100 


100 


-1 


_iiQ2 


_Q1 


199904 


-700 


200 


-2 



negated form [2] are: 

Ml = SPi A SP 2 DealerlD < 001 A ProviderlD > 02 

M2 = SPi A ->SP 2 => DealerlD < 001 A PrcmiderlD < 02 

M 3 = -i5Pi A SP 2 ^ DealerlD > 001 A ProviderlD > 02 

M4 = ->SPi A ->SP 2 ^ DealerlD > 001 A ProviderlD < 02 

These four created minterm predicates define the four fragments of DPM. The 

changes (services and refunds) made to the fact table DFT are recorded in two 

tables Insertion-DFT (for all inserts) and Deletion-DFT (for all deletes). Services 

rendered by (DealerlD, ProviderlD) are (001, 01) with a gain of $300 and (001, 

03) with a gain of $400. Refunds done are as follows: (001, 02) received a service 

gain of $-100 through refund, (001, 01) got $-200, (002, 01) got $-200 while (002, 

01) had $-500. 

To compute the changes to be made to DPM (top view), three virtual views 
Insertion-DPM, Deletion-DPM and Insert-Delete-DPM are created. Insertion- 
DPM, Deletion-DPM are derived from Insertion-DFT and Deletion-DFT respec- 
tively using the same scheme as top view DPM derived from fact table DFT. 
The Insertion-DPM and Deletion-DPM are unionized to create the Insert-Delete- 
DPM virtual view. The net-change table, Net-Change-DPM (Table 3) is created 
by aggregating Insert-Delete-DPM. Each tuple in a net-change table causes a 
single refresh in the corresponding view DPM. The view being used for refresh 
operation is called the Current View, while the tuple of a net-change table, cur- 
rently being used to refresh the Current View is called the Active Tuple. For 
each Active Tuple in a net-change table, the algorithm finds the Needed Frag- 
ment and the Matched Tuple and updates the Matched Tuple according to the 
Active Tuple. If the Matched Tuple is not found, it means the Active Tuple in 
the net-change table is new to the Needed Fragment, then it is inserted into the 
Needed Fragment. 

In order to reduce the refresh time for recomputing the new value from 
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Table 4; Min-DPM 



Rowld 


001-02-9904 


002-01-9904 


002-03-9904 


001-01-9904 


1 


200 


500 


600 


200 



Table 5: The Updated Fragments of DPM 





Frag 


DealerlD 


ProviderlD 


Service 


Total 


MinGain 


Count!*) 






Fi (Ml) 


001 


03 


199904 


400 


400 


1 






F^(M2) 


001 


01 


199904 


350 


50 


2 








001 


02 


199904 


500 


200 


2 






Fs(M3) 


002 


03 


199904 


1000 


400 


2 






F4(M4) 






none 











fact table for the aggregate function MIN in DPM when the current minimum 
value is deleted, we create a view Min-DPM (Table 4) derived from the fact 
table DFT for view DPM. It contains the next smallest gain from the updated 
fact table for each tuple (different sets of group-by attribute values) in DPM. 
The updated fragments of DPM are presented in Table 5. 



3 The Proposed Partitioned View Maintenance 
Schemes (PVMS) 

The partitioned view maintenance scheme (PVMS) consists of a main algorithm 
that takes horizontally partitioned warehouse data cube views and the changes 
made to the fact table as input, and returns the updated horizontally partitioned 
cube views. The approach assumes that the fact table has been updated. In 
order to make warehouse data more available to users, the strategy of splitting 
the maintenance work into propagate and refresh functions [1, 4] is adopted in 
this algorithm. 

The two functions are called from the main algorithm PartitionedView- 
Maintenance. The PartitionedView-Maintenance algorithm first implements the 
Propagate function routine through the functions aggregate- changes-for-topview 
and aggregate-changes-for-View. The formal presentation of the main algorithm 
PartitionedViewMaintenance is given as Figure 1 and further details on both 
Propagate and Refresh functions are discussed next. 

3.1 Propagate Function 

In algorithm PartitionedView-Maintenance, the called function aggregate-changes- 
for-Topview, creates a net-change table for only the top view, Vi from changes 
made on the fact table. The net-change table is used later to update the top 
view. In order to create the net-change table for the top view, all changes on 
fact table are expressed as insertion and deletion pairs. Insertions are recorded 
in a table called Insertion-FT, and deletions are recorded in another table called 
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Algorithm 3.1 (PartitionedView-Maintenance - Maintaining horizontally par- 
titioned views) 

Algorithm PartitionedView-Maintenance() 

Input: set of fragments Fm of horizontally partitioned data cube views Vt, 

for each view, changes Cj from fact table 
Output: a set of updated horizontally partitioned data cube views 
begin 

//I. Propagate function: count the net changes for each partitioned view // 
Net-Change- Pi = aggregate-changes-for-topview(Pi , Cj) //computes for top view // 

/ / Creating the net-change table Net-Change-p for other views in the lattice / / 
for i = 2 to number of views in the lattice 

Net-Change-P = Aggregate-changes-for-view(view p, Net-Change- Ps ) 

// 2. Refresh function: refresh the data cube partitioned view using the // 
for i = 1 to number of views in the lattice 

refreshed partitioned view p = Refresh- View(p, Net-Change-p) 
end // of PVM // 



Figure 1: The PartitionedView-Maintenance Algorithm 



Deletion-FT. The two tables have the same heading format as the fact table. 
Insertion- Pi and Deletion- Pi projected from Insertion-FT and Deletion-FT re- 
spectively using the same scheme as the top view derived from the fact table, 
are created next. Unionizing Insertion- Pi and Deletion- Pi creates a virtual view 
called Insert-Delete-Pi . The net-change table Net-Change- Pi for top view is 
created from Insert-Delete- Pi after aggregating. 

Function aggregate-changes-for-View is used to create a net-change table 
called Net-Change-Pj for each data cube partitioned view, p, that is not top 
view. 

3.2 Refresh Function 

Function Refresh-View is designed for refreshing each horizontally partitioned 
view in the data cube by applying changes in the corresponding net-change 
table computed from the propagate function. It takes all the fragments of a 
partitioned view and the corresponding net-change table as input and outputs 
the refreshed view partitions. The summary of steps involved in the function 
Refresh- View called by algorithm PartitionedView-Maintenance is given below. 
For each Active Tuple in the net-change table, do the following: (1) Find the 
Needed Fragment for the Active Tuple, (2) Find the Matched Tuple in the View 
Fragment. Lastly, update or delete the Matched Tuple according to the informa- 
tion in the Active Tuple. If there is no Matched Tuple, insert the Active Tuple 
in the Needed Fragment. The sum of count columns of both the Active Tuple 
(ATt) and the Matched Tuple (Tj), expressed as ATt.Ccmnt{*) and Tj.Count{*) 
is used to decide if the refresh is going to delete the Matched Tuple Tj or update 
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the aggregate attribute values in the Tj. If the sum is zero, the function deletes 
the Tj, otherwise, it updates the aggregate attribute values. 

3.3 Other Added Features of The PVM Algorithm 

Aggregate functions SUM and COUNT are self-maintainable with respect to 
insertion and deletion. Therefore, the new value of SUM or COUNT can be 
computed by adding ATt.Sum{exp) or ATt. Count {exp) and Tj.Sum{exp) or 
Tj.Count{exp) together. However, MIN and MAX are self-maintainable with 
respect to insertion, but not to deletion. Thus, before updating the aggregate 
attribute values of Tj , the Refresh function checks if it is necessary to re-compute 
the new minimum or maximum value. If Min (exp) of ATt is less than or equal 
to Min(exp) of Tj, or Max(exp) of ATt is greater than or equal to Max(exp) of 
Tj, the new value of Min(exp) or Max(exp) in Tj will need to be re-computed 
from the base table which is time consuming. To reduce time for re-computing 
the new minimum or maximum from base tables, the PVM algorithm uses a 
smaller view called MinMax-View for each partitioned view containing aggregate 
function MIN or MAX. A MinMax-View contains the first k smallest and/or the 
first k largest values from the changed base relations for each group-by set of 
tuples in the corresponding view. The value of k is determined by data warehouse 
administrator according to the application. 



4 Performance Analysis 

Implemented are the proposed PartitionedView-Maintenance Algorithm (PVM- 
algorithm) and the scheme in [4] called MQM-algorithm. The CPU execution 
time (measured in seconds), for maintaining a partitioned view is taken as the 
only maintenance cost in the analysis. After some preliminary runs of both 
schemes on an Oracle 8 DBMS on a Pentium 400MHz IBM PC platform, it was 
seen that the performance of the proposed scheme was affected by both the size 
and the partitioning criteria of a view. Therefore, the following two experimen- 
tal study cases were designed. 

Case 1: Given a fixed size horizontally partitioned data warehouse view, differ- 
ent partitioning criteria were used to obtain different numbers of fragments for 
the view and both the PVM and MQM algorithms were compared. 

Case 2: Given a horizontally partitioned view with varying sizes and same par- 
titioning criterion, both PVM-algorithm and MQM-algorithm were run. 

Table 6 shows the GPU execution time using PVM-algorithm for main- 
taining view V under four different partitioning criteria involving only inser- 
tion, deletion, or update operations. Since insertion takes high percentage of 
maintenance time, the contribution of the PVM algorithm, which drastically 
reduces this time is tangible. Table 7 tabulates the CPU execution time of 
PVM-algorithm and MQM-algorithm for only four fragments in the view. 
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Table 6; Maintenance Time for the 10.000-row View 





Type of Refresh 


CPU Execntion Time 








1 fragment 


2 fragments 


4 fragments 


8 fragments 




Insertion 


6966.5 


3714.5 


1837.0 


1038.0 




Deletion 


4512.8 


2264.5 


1233.7 


672.4 




Update 


4352.9 


2570.0 


1270.6 


629.3 



Table 7: Comparing CPU Maintenance Time for Varying View Sizes 





Size of View 


2000 


4000 


6000 


8000 


10000 






PVM 


374.2 


615.8 


931.7 


1222.4 


1538.9 






MOM 


1015.03 


1870.4 


2912.4 


4052.7 


5280.8 





5 Conclusions and Future Work 

This paper presents a scheme for maintaining horizontally partitioned data ware- 
house cnbe views efficiently since only one relevant fragment of the view is ac- 
cessed for each maintenance operation and not the entire view. The proposed 
scheme handles all three types of refresh operations of deletion, insertion and 
update, and includes a technique for reducing the refresh time for maintaining 
views with aggregate function MIN and MAX. The experimental improvement 
ratios of the maintenance times using the proposed maintenance scheme in com- 
parison to that of [4] is approximately 68% when the size of a view is between 
20,000 and 100,000 rows and this ratio shows to increase steadily with increase 
in the size of the view as shown in Table 6. 

Future work includes conducting experimental studies to measure (1) the 
benefits of storing MinMax-Views, and (2) the utility of this approach in com- 
parison with indexing technique. 
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Abstract: This paper discusses: (1) the challenges that the European Statistical 
System (ESS) faces as the result of the recent appearance of phenomena such as 
the information society and the new economy, and (2) the extent to which new 
technological developments in data warehousing, knowledge discovery and 
extensive use of the internet can contribute to successfully meeting these 
challenges. Two specific issues are considered: the network nature of the ESS, 
and the new ways of producing statistics that reinforce the needs for research 
applied to statistics in domains such as data integration, distributed databases, 
EDI, automated data capture, analytical tools and dissemination. A historical 
overview is given of research activities financed by the European Commission 
as well as their relevance for DaWaK2000. A primary goal of this paper is to 
provide information about relevant research within the European Statistical 
System, and to invite the scientific community to participate actively in 
upcoming calls for proposals and calls for tender financed under the 1ST 
programme to solve the urgent needs for timely and high-quality statistics. 



1 Introduction 

The challenges for the European Statistical System (ESS) have never been greater 
than they are at present. New phenomena, such as globalisation, the appearance of the 
information society, the effects of the “new economy” on employment and 
competitiveness, the creation of and the need for short-term indicators for the euro- 
zone, the on-going and up-coming negotiations for EU membership with the 
candidate countries, etc. have all increased the pressure on the ESS and in particular 
on Eurostat, to produce relevant, timely, and accurate statistical information. 

The ESS responds to these challenges with two types of actions: first, by 
commissioning a large number of research projects ( discussed in detail below); and 
second, by taking concrete first steps towards a common IT policy, which was in the 
past mainly been considered as a private matter of the National Statistical Institutes 
(NSIs). Under the pressure of increasing user demands for more and faster statistics 
(e.g. Internet, e-commerce, mobile phones) and given the difficulties of the traditional 
approach of NSIs promptly to respond to these needs, however, this attitude has 
changed. Given the similarity of concerns of all data providers, the necessity to 
organise the ESS as a network, the common pressure of an extremely dynamic 
technological environment, the European integration process, the scarcity of resources 
and the complexity of the problems, a common IT policy for the ESS seems 
indispensable to all ESS partners. 
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Fig, 1: The European Statistical System (ESS): 

A partnership between the Commission and the 15 
members states 




Various user satisfaction surveys have indicated that users demand higher quality. 
Data must be available on specific dates, to everyone. The Web is becoming the most 
natural way to access information, and information needs are broadening. Access to 
tabular data is no longer sufficient ; the research community wants disaggregated 
information. Common solutions could be considered, as illustrated by the interest of 
some Member States in the creation of Web sites and the implementation of data 
warehouses. NSIs are attempting to integrate their information systems (which, in the 
past, were primarily based on surveys), to make extensive use of administrative 
sources and to promote the use of standards in data exchanges. 

Much has already been achieved: common R&D projects co-financed by Community 
funds; a high level of involvement of the ESS in standardisation activities (GESMES, 
CL ASET); development of a common architecture to enable the distribution of 
statistical services (DSIS); common tools to exchange information (Stadium, Statel); 
common developments in sectors where the European integration is well advanced 
(Edicom); and the launching of a programme for the transfer of technology and know- 
how. 

Elowever, much remains to be done. In order to be prepared for future challenges, 
more in-depth co-operation in the field of IT is indispensable.. NSIs are independent 
organisations with different traditions and cultures. Their sizes differ and they are part 
of administrative systems with a legacy. Eurostat has recently created an IT Steering 
Committee that will address these issues. It will co-ordinate activities in the 
traditional areas of R&D, metadata and reference environments, data exchange and 
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transfer of technology and know-how. Efforts will continue to improve the 
harmonisation of metadata. The areas of e-commerce, data warehousing and metadata 
have been identified as topics of high interest and will be treated with priority. 



2 The Statistical Production Process 



Official statistics are difficult to define in terms of statistical methods employed. For 
the purpose of this presentation, it is more useful to look at the statistical production 
process which can be roughly divided into four parts: (1) methods and concepts 
definition; (2) data collection; (3) data processing and analysis; and (4) dissemination 
to end-users. The advantage of this process view is that the needs of official statistics 
can be defined in terms of process quality, and statistical and methodological techniques 
can be seen as tools for achieving the production goals. The needs within each of these 
phases is discussed below. 

Methods and concepts definition: Although less significant for DaWaK, this phase 
is probably the most essential: without adequate and harmonised methods and 
definition of nomenclatures, units, variables, etc. no comparable statistics can be 
produced. Since official statistics were first collected, considerable efforts have been 
made to harmonise definitions at the world level. Nonetheless, a new conceptual 
approach is badly needed to describe the new information economy, its shape, 
structure and dynamics, the transition from the “old” economy and its implication in 
terms of technologies, impact on businesses, on the economy and on citizens. 

Data collection and data sources: The ESS is composed of Eurostat and the NSIs, 
with Eurostat being the focal point of pan-European statistical activities. Eurostat 
does not engage in data collection, as this is the job of the NSIs and their national 
partners. Eurostat’s role is to check and guarantee the quality of data, including its 
accuracy, comparability, timeliness, coverage, etc. In Working Party meetings, 
Eurostat negotiates guidelines with the representatives of the Member States, 
designed to achieve the indispensable harmonisation of data collection (e.g. 
harmonised questionnaires in all languages for all Member States) and data 
processing (e.g. common approach for data editing and imputation). 

Many countries have already experienced the trend towards indirect data acquisition 
in electronic form from administrative registers and distributed databases. 
Furthermore, direct data gathering through questionnaires and even interviews is 
increasingly computer-assisted and at least partially done in an EDI form. Finally, 
new data sources such as electronic commerce records and automated data capturing 
devices have emerged. The present need to cope with data from different sources 
raises the difficult question of the efficient combination of multiple data sources. The 
new techniques of data collection present interesting technical and analytical 
challenges. 

Data processing and analysis: Information technology and statistical techniques play 
a crucial role in the development of in-house processing of incoming data. Both 
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nascent and well-developed statistical techniques have become feasible with the 
computer age. In terms of production efficiency, there is a need to semi-automate the 
most time-consuming phases, such as data editing and imputation. Thus, there is a 
need to develop procedures based on techniques such as tree-based classification 
methods, graphical outlier detection, machine learning such as neural networks or 
genetic algorithms, etc. There are several specific areas, such as time-series analysis 
and metadata management, that require both harmonisation and commonly adopted 
modern methods. An emerging area known as knowledge discovery in databases 
(KDD) is likely to play a more prominent role in the analysis and data processing 
efforts of tomorrow. 

Dissemination: Official statistics are traditionally presented in the form of printed 
tables. However, modern electronic forms of dissemination, such as the Internet, are 
developing rapidly. Thus, there is both a need and a technical possibility to present 
statistical information in more detail, interactively and using appropriate visualisation 
tools. The needs expressed by the socio-economic research communities to access 
individual data to perform their analysis sometimes conflict with the protection of the 
confidentiality of the data and the respect of the privacy guaranteed by the statistical 
laws. Improved protection algorithms are needed. 



Fig. 2: The evolution 

The traditional way of producing statistics 



Business survey 



Household survey 



Labor foro 



Concepts and 
methods 
definition 



Surveys 



:e survey 



other ... 



Data 

processing 
and analysis 



Data 

dissemination 



The new directions 




Evolution: Historically, the statistical process was conceived and organised around 
the concept of the survey. A need was identified, specific concepts and questionnaires 
were developed, and a survey of the specific statistical units were established. The 
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data were processed and disseminated, without any special links to other sources of 
information. 

The development of computerised registers (e.g. enterprises, persons), the 
development of different administrative sources (e.g. tax authorities, employment 
authorities), the development of automated data collection (e.g. extraction from 
corporate accounting systems, satellite images, bar codes, the internet) made it 
possible to conceive a far more integrated approach, where a maximum of data is 
collected without the active participation of the responding unit. Specific surveys are 
seen as complementary information sources. 

This evolution raises a large number of questions (including ethical questions), but 
forces the statistical institutes to incorporate new tools and techniques for the 
collection, integration, and analysis of large databases for safe and efficient 
dissemination. 



3 The European Plan of Research in Official Statistics 

Over the last few years, Eurostat has developed, in co-operation with the NSIs, the 
European Plan of Research in Official Statistics (EPROS), which is an inventory of 
R&D needs and a guide for further developments. Eor a complete description see the 
Web site of the Virtual Institute for Research in Offici al Statistics (VIROS) at 
^ttpj^uropa^euinEe^com^^urosta^esearcMnteo^de^ht^ 

EPROS is presented following the NORIS classification (Nomenclature on Research 
in Official Statistics). The first digit classification reads; 

1 . Methodological issues 

2. Advanced technologies for data collection 

3. Quality issues 

4. Data analysis and statistical modelling 

5. Multi-data sources, integration and systematisation 

6. Dissemination and disclosure control 

7. Other innovative applications of information technologies 

Some of the second digit headings are particularly interesting for DaWaK , such as 
2.3 Automatic data capture, 4.1 Data analysis and knowledge extraction, 5.2 Multi- 
source environments and 5.4 Distributed database management and analysis systems. 
The reader is referred to the original documents at the above-mentioned Web site for 
a complete NORIS presentation. Additionally, a number of documents explaining the 
scope of the various calls for proposals can be found. 
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4 Research in Statistics through the former EU Framework 
Programmes 

The present paradigm of official statistics is a highly complex production process which 
must achieve goals of timeliness, accuracy and relevance. Major investment into R&D 
is required, but the burden is often too heavy for most NSIs. Thus, a critical mass of 
partners has to be achieved. As most R&D in official statistics are international by 
definition, the results should be useful both at national and at European level. The role 
of Eurostat in the co-ordination of research is to achieve the new production model at 
the national level as well as the European level, which brings a further advantage: that 
of achieving harmonisation and international comparability.. 

To face these challenges, Eurostat has promoted the development and the use of new 
methods and tools since 1989. It has also reinforced the co-operation between the ESS 
and the scientific communities. The main mechanisms to implement this activity have 
been the Community R&D Framework Programmes, in which R&D in statistics has 
steadily grown in importance. Eurostat also has a strong tradition in the organisation 
of scientific conferences (e.g. KESDA 98, NTTS 98, ETK 99) and Workshops (e.g. 
Statistical Metadata, Confidentiality), in which areas of particular relevance for 
Official Statistics are discussed together with academics and representatives of the 
private sector. Over the last years, the main research programmes supporting the ESS 
have been the DOSES, SUPCOM and DOSIS programmes. Currently, the EPROS 
programme is being implemented. 



Fig. 3: R&D amounts and periods 
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SUPCOM: Since 1995, Eurostat has benefited from a specific research budget line of 
the 4“" Framework Programme entitled “Research Support Activities on a competitive 
basis”. This programme was created to fulfil the research needs of the Commission 
services, and has been used by Eurostat to provide and disseminate existing 
information and to harmonise the system of data collection and analysis which was in 
place in the European Union. 

Eurostat used this opportunity to launch 93 R&D studies (for a total of 13. 7M )on a 
wide variety of statistical topics, of which 22 were awarded to NSIs. More than two- 
thirds of these projects have been completed, and the results are progressively made 
available on the Commission’s Website. Overall, half of the SUPCOM projects 
concentrated on methodological issues, dealing with harmonisation and comparability 
(often applied to social statistics), environment, transport and business. Another well- 
covered topic is the adaptation of the statistical system to the new IT realities (e.g. use 
of the internet, new graphical tools, computer assisted training). Emerging phenomena 
and techniques were also explored (e-commerce, neural networks, etc.). The first 
results of an on-going external evaluation demonstrate that 90% of the projects 
evaluated thus far contribute to the work of Eurostat. 

A number of projects fall within the domain of this conference, of which the most 
interesting are: Data Fusion in Official Statistics, Design of an Integrated Statistical 
Meta-information System and Creation of a CD-ROM on Metadata for Use in 
National Statistical Institutes, Databases and Metadata in Reference Systems, and 
Meta-Databases for Statistical Information. Further information on these and related 
projects can be found at the VIROS Web site. 

DOSIS: The DOSIS (Development of Statistical Information Systems) projects 
provide the statistics component of the 4* Framework Programme (1995-1998) with a 
strong emphasis on the application of results geared to improving the competitive 
position of European industry and the efficiency of NSIs. It was also intended that the 
Framework Programme would encourage co-operation through the creation of 
multinational consortia drawn from academia, government and the private sector. 
Eighteen projects were co-financed, for a total budget of 17.6 M. 

A number of these projects, some of which have already been completed and most of 
which will come to an end in the course of this year, are within the scope of DaWaK. 
ADDSIA investigates distributed database techniques and the World Wide Web 
technology to improve access to statistical data for the European research and policy 
community. AUTIMP will develop automatic imputation software for business 
surveys and population censuses. TADEQ will develop a prototype of a tool capable 
of documenting and analysing the contents and structures of electronic questionnaires. 
SDC addressed the problem of disclosure control of confidential data. Both 
aggregated data tables, when a unit takes a high proportion of the cell, and microdata, 
when a record can be identified due to some of its public characteristics, are tackled. 

Research in statistical metadata is of particular relevance for the ESS. Several 
standards exist: international methodologies and classification systems (national 
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accounts, activity or products nomenclatures, etc.); general guidelines for statistical 
metadata on the Internet published by UN/ECE; general and special data 




dissemination standards from the IME; and the OECD standard for Main Economic 
Indicators. Eurostat is participating in this research with the long-term goal to 
construct an integrated European metadata reference system for statistical data. At the 
same time, there is a growing need to develop systems for storing and harmonising 
metadata, without waiting for the various longer term solutions to become 
operational. Related problems were discussed during the Workshop on Statistical 
Metadata last Eebruary. The following broad priorities were identified: (1) a greater 
collaboration with international statistical agencies for the harmonisation of metadata 
content requirements; (2) the setup of networks of experts for developing and 
implementing specific tools for the management of metadata; and (3) promoting the 
use of existing metadata standards and tools. Two DOSIS projects have been funded: 
in order to avoid that the provision of metadata for the user of statistics is a special 
documentation process that is very time and resource consuming, IMIM will develop 
a metadata management system that is integrated with the whole production process. 
In addition, IDARESA will design and implement an integrated statistical data 
processing environment making use of meta-information methodology that is aimed 
at storage, access and dissemination of statistical information. 

Eurthermore, there are a number of projects dedicated to knowledge extraction and 
symbolic data analysis in Official Statistics. The goal of the KESO project is to 
develop a prototype knowledge extraction or data mining system for identifying 
patterns and relationships in statistical datasets. BAKE will develop an environment 
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able to extract Bayesian Belief Networks from databases. The first step will be to 
investigate the feasibility, relevance and applicability of the approach to official 
statistics by selecting 10 problems provided by NSIs. Finally, SODAS has developed 
techniques and software to facilitate the use of symbolic data analysis in NSIs. These 
three projects testify to the strong interest of the ESS in knowledge extraction 
research. Moreover, Eurostat organised a well-attended conference, KESDA 98, on 
Knowledge Extraction and Symbolic Data Analysis (see [1] for more information). 

Finally, the increased access to large, complex datasets coming from distributed 
databases necessitates the availability of intelligent visualisation tools that can be 
tailored to user needs. IVISS will develop an interactive visualisation system that can 
be expanded and customised, and will help the user in the analysis of complex data. 

At the time of writing this paper, half of the DOSIS projects have been completed. 



5 Research in Statistics through the 5th Framework Programme 

The fifth Framework Programme, which was initiated 1999, is the first explicitly to 
recognise the importance of statistics for the information society. Research in 
statistics was included as a specific action line after consultation with policy makers, 
NSIs, research communities, and private sector companies that are the users and the 
developers of new tools, methods and concepts. In response to the challenges that the 
ESS faces, the EPROS plan has been drafted according to the following principles: 
that projects should have clear and appreciable R&D content; reflect the needs of the 
ESS and other producers/users of statistics; make use of new developments in 
technology; have a clear exploitation plan; and rest on cross-national and 
multidisciplinary partnerships. The implementation of EPROS involves the following 
groups of actors: official and academic statisticians with theoretical and domain 
expertise; computer scientists and information managers; other professionals, e.g., 
economists, in multidisciplinary teams; users, producers and providers of data and 
information technologies. Eive key elements of EPROS are briefly described below: 

Research & Development into generic statistical tools, techniques, methodologies 
and technologies: This covers the whole statistical production process. Hence, it 
encompasses the definition of concepts and methods through to the analysis and 
dissemination of statistics, as defined in the NORIS (Nomenclature of Research in Official 
Statistics) classification. There is particular emphasis on achieving the highest standards of 
data quality, and on automating data collection, capture, interchange, editing, imputation 
and dissemination. 

Statistical Applications: the aim here is the use of statistical tools, techniques, 
methods and technologies in support of domain research activities in other parts of the 
research programmes. 

Statistical Indicators for the New Economy (SINE).- recent trends in Information 
and Communications Technology (ICT) together with the evolution of globalisation 
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have led to the emergence of the “New Economy”. The long-term structural shift 
from the industrial economy towards the services and the information economy must 
be monitored by statistical indicators. It is likely that conventional concepts and 
classifications will be inappropriate. Hence, there should be considerable emphasis on 
innovative, forward-looking indicators relating to emerging activities. 

Transfer of technology and know-how initiative (TTK). a key objective of the 
R&D activities is to achieve transfer of technology and know-how in the ESS and 
other relevant organisations and institutions that could benefit from the knowledge 
and tools developed within EU research, either steered by the Commission, by 
national R&D programmes or by NSIs themselves. 

Supporting activities: other complementary measures are foreseen to prepare, extend 
or disseminate the R&D and TTK projects. 

In 1999, 1ST work programme CPA4 (Cross Programme Action 4), two calls for 
proposals for new indicators and statistical methods were issued. Among others, 
proposals were invited “to develop and demonstrate new methods and tools for the 
collection, interchange and harmonisation of statistical data, ...”. In 2000, under CPAS 
(Cross Programme Action 8), foci have been defined, for instance on “statistical data 
mining, . . . the use of administrative data for statistical purposes, . . . improvements in 
quality and in timely and low-cost data production . . .”. 

Thus, three calls for proposals have been issued to date. At the time of writing this 
summary, only the results of the first call are available and final. Thirteen research 
proposals of the forty-one proposals submitted to the 1ST Programme were 
eventually funded. In the second call, which is dedicated to research on Statistical 
Indicators of the New Economy (SINE), six proposals have been evaluated positively 
and the Commission will soon begin negotiating with the consortia. With respect to 
the third call, which was again open for methods and tools, applications and 
accompanying measures, some fourteen proposals have proceeded into the second 
round. A final decision is expected by the end of this year. 

The first projects funded under the 5* Framework Programme began in 2000. A large 
range of research activities is covered: improvement of classifications, automated 
questionnaires, ex-post harmonisation techniques, editing and imputation, business 
cycle analysis and metadata (with a specific interest on standardisation). The 
statistical indicators of the new economy are addressed via the description of the 
content of the Internet, the tracking of e-commerce, new indicators for the knowledge 
economy and for innovation through patent databases etc. 

From the first call, a number of funded projects is again closely related to the research 
activities of DaWaK. EUREDIT will evaluate and compare the current methods for 
editing and imputation to establish current best practice methods. In addition, new 
methods for editing and imputation based on neural networks, fuzzy logic 
methodology and robust statistical methods will be developed and compared with the 
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current best practices. IQML will provide the solution for the collection of timely 
statistics to support regional, national and Community policy making, whilst at the 
same time reducing the reporting burden on enterprises. The technical objective is to 
understand how to develop an intelligent questionnaire that can interrogate company 
database metadata and questionnaire metadata in order to extract the required data 
automatically. MISSION will provide a software solution that will utilise the 
advances in statistical techniques for data harmonisation, the emergence of agent 
technology, the availability of standards for exchanging metadata and the power of 
Internet information retrieval tools. The result will be a modular software suite aimed 
at enabling providers of official statistics to publish data in a unified framework, and 
allowing users to share methodologies for comparative analysis and harmonisation. 
The modules will be distributed over the Web and will communicate via agents. 
METAWARE will develop a standard metadata repository for data warehouses and 
standard interfaces to exchange metadata between data warehouses and the basic 
statistical production system. The aim is to make statistical data warehouse 
technologies more user-friendly for public sector users. The project will build upon 
the results of the 4* Framework Programme (in particular the IMIM and IDARESA 
projects) and develop object-oriented metadata standards for statistical data 
warehouses. 



6 The future Framework Programmes 

As the Framework Programme is implemented through annual work programmes, it is 
currently too early to detail the activities of the year 2001. However, the main 
intention for the two next years is to continue the implementation of EPROS, whilst 
taking into account any newly emerging domains. 

One initiative that could strongly influence next year’s work programme is the recent 
eEurope initiative of the European Union. This initiative was launched by the 
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European Commission in December 1999 with the objective to bring Europe on-line, 
by promoting a cheaper, faster and more secure internet, by investing in people and 
skills, and by stimulating the use of the internet. At the recent Council summit in 
Lisbon, the Heads of State and Government committed themselves to a number of 
measures, including a target date to bring eEurope forward. Ten areas were identified 
for which action on a European level would add value. The ESS should be associated 
with the collection of the data needed for the benchmarking exercise. Part of the 
funding for this activity might come from the 1ST programme. The benchmarking 

exercise will be necessary to ensure that actions are carried out efficiently, have the 
intended impact and achieve the required high profile in all Member States. Eurther- 
more, the next Eramework Programme (EP6) is currently under preparation. Eurostat 
intends to continue the R&D activities in line with the EPROS document. 
Consultation will take place with the research communities, the NSIs, and the users in 
order better to understand the needs and technological opportunities for future 
research. 



7 Conclusions 

Overall, since 1996, the R&D activity undertaken or supported by Eurostat has been 
considerable. In the last five years, 55 M have been spent on 130 projects, involving 
more than 350 research teams covering all fifteen Member States. NSIs from twelve 
Member States participate in 96 research projects. The aims of promoting research in 
the ESS, of reinforcing the research potential and of strengthening relations between 
official and academic statisticians are being gradually achieved. 

Eurostat has commissioned an independent evaluation of the research projects funded 
under the SUPCOM initiative. In the panel discussion of last year’s ETK Conference 
(Exchange of Technology and Know-how) in Prague, it became clear that better 
mechanisms are needed to transfer the research results into the daily work of the 
NSIs. Research teams very often develop only prototypes. Effective technology 
transfer requires further analysis and major adaptations, something that the research 
teams will not and the NSIs usually cannot deliver. To address these issues, and to co- 
ordinate the transfer of technology and know-how, Eurostat has set up, jointly with 
the Joint Research Center in Ispra, the European Statistical Laboratory. 
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Abstract. On-line analytical processing provides multidimensional data 
analysis, through extensive computation based on aggregation, along 
many dimensions and hierarchies. To accelerate query-response time, 
pre-computed results are often stored for later retrieval. This adds a 
prohibitive storage overhead when applied to the whole set of aggre- 
gates. In this paper we describe a novel approach which provides the 
means for the efficient selection, computation and storage of multidi- 
mensional aggregates. The approach identifies redundant aggregates, by 
inspection, thus aliowing only distinct aggregates to be computed and 
stored. We propose extensions to relational theory and also present new 
algorithms for implementing the approach, providing a solution which is 
both scalable and low in complexity. The experiments were conducted us- 
ing real and synthetic datasets and demonstrate that significant savings 
in computation time and storage space can be achieved when redundant 
aggregates are eliminated. Savings have also been shown to increase as 
dimensionality increases. Finally, the implications of this work affect the 
indexing and maintenance of views and the user interface. 



1 Introduction 

On-line analytical processing (OLAP) provides multidimensional data analysis 
to support decision making. OLAP queries require extensive computation, in- 
volving aggregations, along many dimensions and hierarchies [Codd93] . The time 
required to process these queries has prevented the interactive analysis of large 
databases [CD97]. To accelerate query- response time, pre-computed results are 
often stored as materialised views for later retrieval. The full materialisation 
relates to the whole set of possible aggregates, known as the data cube. In this 
case the additional overhead is orders of magnitude larger than the input dataset 
and thus very expensive in terms of both the pre-computation time and storage 
space. The challenge in implementing the data cube has been to select the min- 
imum number of views for materialisation, while retaining fast query response 
time. In this paper, we propose a new approach which identifies and eliminates 
redundancy in the data representation of multidimensional aggregates as hrst de- 
scribed in [KotOO] . The approach identifies redundant views and then computes 
and stores only the distinct or non-redundant ones. The whole set of aggregates 
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can, however, be retrieved later, from the selected subset, in the querying process 
without any processing cost. The algorithm which identifies the redundant views 
requires approximately 10% of the time required to compute the data cube con- 
ventionally. As the selected subset is smaller than the full data cube, the overall 
computation is considerably faster than the conventional method described by 
[GBLP96]. Further contributions of this paper are: 

— Novel extensions to relational theory with regard to redundancy in an OLAP 
context 

— New algorithms for the identification and elimination of redundant views 

— An extensive set of experimental results confirming the theory by empirical 
measurements with the goal of demonstrating fairly the practicability of the 
new approach 

The paper is organised as follows: Section 2 describes the fundamental as- 
pects of the OLAP model. Section 3 discusses related work in implementing the 
data cube. Section 4 describes the Totally-Redundant Views. Section 5 presents 
the Key-algorithm for the identification of redundant views. Section 6 describes 
the improved implementation of the data cube. Section 7 presents the results 
from the experiments conducted in this work. Finally, Section 8 presents the 
conclusion. 

2 The Multidimensional Model 

The typical organization of data in a relational OLAP system is based on the 
star schema design with a fact table and several dimension tables [Kim96] . For 
example, a fact table could be the Sales relation with the following schema: 

S'afes(ProductId, LocationId,TimeID, Sales) 
and the following dimension tables: 

Product (ProduetID, Type, Category) 

Location(LocationID, City, Country, Continent) 

Time(TimeID, Month, Year) 

Each attribute of the Sales relation may be classified either as a dimension 
or a measure of interest. In this example, the measure of interest is Sales and 
the dimensions are Product, Location and Time. 

The operations involved in OLAP queries are typically aggregates in one or 
more dimensions. A given measure in n dimensions gives rise to 2" possible 
combinations. In the table Sales above, the following aggregations are possible: 
(Product-Location), (Product-Time), (Location-Time), (Product- Location- 
Time), (Product), (Location), (Time), (All). 

Gray et al.[GBLP96] introduced the Gube-by operator as the n-generalisation 
of simple aggregate functions. It computes every group-by corresponding to all 
possible combinations of a list of dimensions. In the previous example the 2" 
possible combinations can be expressed by one SQL statement: e.g. 
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SELECT S.Product S.Location, S.Time as (SUM) Sales 
FROM Sales 

CUBE-BY S.Product S.Location, S.Time 

The benefit of the Cube-By operator is that the user is no longer required to 
issue all possible group-by statements explicitly. The impact is greater when 
multiple dimensions with hierarchies are involved, resulting in thousands of ex- 
plicit Group-by statements. In the next section we will discuss existing methods 
for the implementation of the cube. 

3 Implementing the Data Cube 

Several approaches have been proposed for implementing the data cube which 
differ from the approach proposed in this paper. To the best to our knowledge, 
there is no work which has considered identification and elimination of redun- 
dancy in the data representation of the multidimensional aggregates for the 
efficient implementation of the data cube. However our approach is compatible 
with previous approaches and may be integrated with them. The computation of 
the cube has been studied by [DANR96], [SAG96], [ZDN97], [RS97] and [BR99]. 
Several optimisations which have been proposed by [GBLP96], [AAD+96] and 
[SAG96] are: smallest parent, cache results, amortise scans, share sorts and share 
partitions. In the selection of the materialised views, work has been presented 
by [HRU96], [BPT97], [Gup97] and [SDN98]. [HRU96] presents the Greeedy al- 
gorithm that uses the beneht per unit space of an aggregate. The inputs to the 
algorithm are the amount of space available for pre-computation and a set ini- 
tially containing all aggregates in the lattice, except the raw data. The output is 
the set of aggregates to be materialised. The authors [HRU96] have shown that 
the benefit of the Greedy algorithm is at least 63% of the optimal algorithm. 
[SDN98] has evaluated the Greedy Algorithm and shown that it needs a pro- 
hibitive amount of processing. The algorithm PBS (Picked by size) is proposed 
as an improvement of the Greedy algorithm[HRU96]. PBS picks aggregates for 
pre-computation in increasing order of their size. The algorithm’s performance is 
the same as the greedy algorithm but only requires a fraction of the time needed 
by the Greedy Algorithm. [BPT97] presents a heuristic for the identification 
of approximately equally sized group-bys. However their technique is based on 
size estimation - which is restricted to uniform distributions - and also requires 
knowledge of functional dependencies. The authors of [BPT97] have also ob- 
served that the number of representative queries is extremely small with respect 
to the total number of elements of the complete data-cube. Thus the indication 
of the relevant queries is exploited to guide the selection of the candidate views, 
i.e., the views that, if materialised, may yield a reduction in the total cost. Re- 
cently, [BR99] introduced the Iceberg-Gube as a reformulation of the data cube 
problem to selectively compute only those partitions that satisfy a user-specihed 
aggregate condition defined by a selectivity predicator (HAVING clause). The 
authors of [BR99] claim that this optimisation improves computation by 40%. 
However, this approach is inherently limited by the level of aggregation specified 
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by the selectivity predicate HAVING-COUNT(*). Compressed methods have 
been also proposed for the materialised views by [OG95], [WB98] and recently 
[KM99] and approximate methods by,[BS98] and [VW99]. 

4 Totally-Redundant Views 

Totally-Redundant (or T-R) views are the views derivable from their parent 
input relations which are already computed and stored as distinct aggregates. 
The T-R views are thus virtual representations of their parent input relations 
and they may be derived from them by a simple projection Their identifica- 
tion is based on the g- equivalent property as explained in definition 4 overleaf. 
Table 1 introduces notation which will be used to discuss the theory of Totally- 
Redundant views as follows: 



Notation 


Description 


R 


Input (or parent relation of R) 


R' 


Aggregation (or child relation of R) 


tR 


Tuple in R 


tw 


Tuple in R' 


Cr 


Cardinality of R 


Cri 


Cardinality of R' 


St 


Set of grouping attributes in tR 


Sn 


Set of grouping attributes in Ir, 


Mr 


Measure of interest in R 


Mr, 


Measure of interest in R! 



Table 1. Notation of main components 



In line with Codd [Codd70], when a domain (or combination of domains) of a 
given relation has values which uniquely identify each element (n-tuple) of that 
relation it is called a candidate key. We classify candidate keys into two types: 
Definitional and Observational keys. 

Definition 1. Definitional Keys are those keys which are defined as part of the 
database schema (e.g., by the database designer). 

Definition 2. Observational Keys are those keys which are defined empirically. 

Thus an Observational key is invariant in a read-only database but may be 
modified by updates to the dataset. A Definitional key always possess a unique 
identification property despite updates. 

^ Simple projection refers to a projection which does not require duplicate elimination 





150 N. Kotsis and D.R. McGregor 



Definition 3. A tuple tn' is defined to he group-by - equivalent or g-equivalent 

(^) to a tuple tR if, and only if, the set of grouping attributes St> is a subset 
of the set of grouping attributes St and the measure of interest is equal for both 
R' and R (see Figure 1). 

tw ^ Ir iff St' C St and Mr> = Mr 



r'(P2, LI, 30) # R (P2, LI, T2, 30) 

Fig. 1. The g-equivalent tuple 



Definition 4. A relation R! is defined to be g-equivalent (^S> ) to a relation R 
if, and only if, for every tuple in R there is a g-equivalent corresponding tuple 
in R and both relations have the same cardinality. 

R^ R' iff (V tR 3 tRi such that tR> ^ Ir) and Cr> = Cr 



4.1 Extending the Relational Theory: Totally- Redundant Views 

Theorem 1. When the result relation R! of an aggregation has the same cardi- 
nality as the parent relation R then each tuple tR> is g-equivalent to the corre- 
sponding tuple tR, both in its grouping attributes and in its measure of interest. 

Proof In an aggregation operation, each tuple tm is derived either from a single 
tuple or from several tuples of the input relation R. If a tuple tm is a result of 
several tuples in R then there is a reduction in cardinality of the relation R 
relative to the relation R. Thus, if the cardinality of R and R is the same, 
then each tuple tm must have been derived from only a single tuple Ir, and 
hence must be g-equivalent to the corresponding tuple of R in both its projected 
dimensional values and its measure of interest. 

Theorem 2. Any aggregation of a relation R over any set of domains which 
includes a Candidate Key, produces a result relation R in which each resulting 
tuple must be g-equivalent to the corresponding tuple of R in both its grouping 
attributes and its measure of interest. 

Proof Any projection or aggregation of R that includes a candidate key pre- 
serves the same number of tuples. Thus, any aggregation or projection which 
includes a candidate key of R, produces a result relation R with the same car- 
dinality as R. Thus, (by Theorem 3.1) each tuple in R must be identical to 
the corresponding tuple of R in both its projected dimensional values and its 
measure of interest. 
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Theorem 3. (Converse of Theorem 1): When an aggregation or projection of 
a parent relation R over a set of domains produces a result relation R with 
the same cardinality as R, then that set of domains contains an Observational 
candidate key of both R and R . 

Proof If an aggregation or projection of a parent relation R produces a re- 
sulting relation R' with the same cardinality, then the dimensions over which 
the aggregation has been carried out must uniquely distinguish each tuple of R. 
Each resulting tuple must have been derived from a single parent tuple (if this 
were not so, then aggregation must have occurred, with a resultant reduction in 
the cardinality of R'). Hence if the cardinalities of R' and R are the same, the 
dimensions of the aggregation must include a observational candidate key of R. 

4.2 Example of Totally-Redundant Views 

Figure 2 shows the input relation R and the g-equivalent aggregate relation R' . 
R and R' have equal cardinality and for every tuple in R' there is a g-equivalent 
corresponding tuple in R. The aggregate relation R' is redundant since it can 
always be produced by a simple projection of R. 



R 


Product 


Location 


Time 


Total_Sales 


PI 


Ll 


T1 


80 


P2 


L3 


T4 


20 


PI 


L2 


T3 


50 


P4 


Ll 


T1 


30 


P3 


Ll 


T3 


80 


P4 


L3 


T2 


100 


PI 


L3 


T1 


45 


P3 


L2 


T3 


70 


P2 


Ll 


T2 


30 





Product 


Location 


Total_Sales 


PI 


Ll 


80 


P2 


L3 


20 


PI 


L2 


50 


P4 


Ll 


30 


P3 


Ll 


80 


P4 


L3 


100 


PI 


L3 


45 


P3 


L2 


70 


P2 


Ll 


30 



Fig. 2. The Input Relation R and its g-equivalent aggregation R 



4.3 Utilizing the Totally-Redundant views in the Implementation 
of the Data cube 

Utilising Theorem 3.1 avoids the explicit computation and separate storage of 
any aggregate which includes a key. This not only has a major impact on the 
computation time, but also reduces the storage requirements of the materialised 
views. The T-R approach proceeds in two stages: 

— Stage 1: Identification of T-R views by determining the set of Observational 
keys. 

— Stage 2: Fast Computation of the data cube by avoiding the processing and 
storage of T-R views. 
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5 Stage 1: The Key Algorithm 

Briefly stated, the approach adopted empirically determines all the Observa- 
tional keys ^ present in a given parent relation prior to the materialisation of 
the data cube. The algorithm examines all possible aggregates in the data cube 
and classifies each either as Totally-Redundant or not. Thus, the algorithm ex- 
amines whether each group-by includes one of the already detected keys - if so it 
can be categorised immediately as g-equivalent to the input relation and hence 
Totally-Redundant. The remaining group- bys, with maximum size smaller than 
the size of the input relation, can not be candidates for the equivalence property. 
The remaining aggregates are tested - to see whether any two tuples of the input 
relation are combined during aggregation - using the First Duplicate Detector 
(FDD) routine described later in this section. The Key algorithm proceeds from 
the minimum ) to the maximum arity and uses the FDD to scan each group-by 
until it detects the first duplicate tuple. When a duplicate is found, the current 
group-by is an aggregate (not g-equivalent relation) and hence, according to the- 
orem 3.1, not a key. If there are no duplicates, then the schema of the group-by is 
an Observational Key. With the discovery of a key, the algorithm eliminates from 
further consideration all subsequent group-bys of greater arity including the key 
in their schemas. Such group-bys are thus also Totally-Redundant views. 

5.1 The Complexity and Performance of the Key Algorithm 

Given a relation R{A\,A2, ...,An) with n dimensions, the complexity of the 
algorithm is 0(C * 2"), where n is the number of dimension attributes and C is 
the cardinality of each group-by. C is an upper bound since the Key algorithm 
exits at the first duplicate which is normally detected before completing the 
scan of the whole group-by. The supersets of each Observational key must also 
possess the unique identification property and hence these group-bys can also be 
eliminated by inspection. If a schema of m attributes has been recognized as an 
Observational key of an n-dimension relation i?, then the number of times the 
schema’s m attributes will appear in the 2” possible aggregations is 2"“'". The 
maximum benefit which can be derived occurs when m = 1 and the least benefit 
when m = n and thus there is no key (n is the superset in R) . For example a data 
cube of ten-dimensions would produce 1,024 aggregates. A key of two dimensions 
would reject 2^°“^ = 2® = 256 aggregates as Totally-Redundant and thereby no 
computation or storage would be required for them. The performance of the Key- 
algorithm in our experiments required approximately 10% of the conventional 
data cube processing time. 

The Key algorithm 

Input: search lattice of the input Relation R 

^ The candidate key can be determined either by definition (database catalogs) or us- 
ing the [L078] algorithm to find keys for a given set of attributes’ names and a given 
set of functional dependencies. However, this approach would require knowledge of 
the functional dependencies. 




Elimination of Redundant Views in Multidimensional Aggregates 153 



Output: Set of Observational Keys K - array of strings 
i := 0; 
s := 0; 

K := null', 

while i < NoOf Combinations — 1 do /* NoOf Combinations = T * j 

if GroupBy[i\.size < R.size then begin /* The size is an upper bound */ 
if Group By [i], schema € K{s) then /* This is a redundant GroupBy */ 
i := i + 1; 

else if found duplicate then /* First Duplicate Detector * / 

i := i + 1; /* This GroupBy is an aggregated relation - it’s schema is not a Key */ 

else begin 

s s + 1; 

add the CroupBy schema to K{s) 

end; 

end; 

i := i + 1; 
return set K; 
end; 

The First Duplicate Detector. To examine whether a specific set of domains is 
an Observation key the First Duplicate Detector (FDD) is used. This is a simplified 
hash-based aggregation which exits true when a duplicate is found, false if none is 
detected. The complexity of the algorithm is equal to the cardinality of a relation. In 
practice however, the average complexity is much smaller since a duplicate tuple is 
usually found (if it exists) before the full scan of the relation. 

5.2 The Recursive Key Algorithm - Derivative Relations 

The Key-algorithm, as described in Section 5 identifies g-equivalent relations to 
the input base relation. Further redundancy can be eliminated by applying the 
Key-algorithm recursively to the derivative aggregate views. The experiments 
outlined in Section 7.2 indicate that a further reduction in storage of up to 60% 
can be achieved when the derivatives are utilised. Overall, Totally-Redundant 
views effect storage savings of up to 85% (TPC-D 60K with 10 dimensions) of 
the volume. 



5.3 Implications of Totally-Redundant views 

Potentially the Totally-Redundant views approach can be utilised to reduce the 
workload in selection, computation, storage, indexing and maintenance of the 
materialised views in a data warehouse environment. Indexing is not required 
for the g-equivalent views since they can utilise the index of their input parent re- 
lation. Maintenance is also faster if no duplication with regard to the g-equivalent 
view is inserted. In this case, no update of this view is needed. Through its iden- 
tification of redundant views, the Totally-Redundant views approach reduces the 
browsing time in the interface environment, since views with the same content 
do not require storage or viewing. 
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6 Stage 2: Fast Computation of the Data Cube 

Although the main algorithm to compute the data cube is based on [GBLP96], 
our improved implementation utilises the T-R views to avoid computation and 
storage of aggregates already computed and stored. The improved data cube 
algorithm has implemented as follows: 

Input: search lattice of the input Relation R 
Set of Keys K, array of Keys 
Output: Set of computed aggregates as views V 

begin 

i := 0; j := 0; V := null', 

while NoOf Combinations — 1 > i do /* From maximum to minimum arity 

where NoOf Combinations = 2" */ 

begin 

for j:=0 to entries in K 

begin 

if K{j) C GroupByfi] .schema /* Totally redundant view */ 
then 
i := i + 1; 

else /* not a Totally-Redundant view-candidate for computation */ 
Aggregate( GroupBy[i])\ 
add the GroupBy schema to V[i] 
i := i + 1-, 

end; 

end; 

return set V; /* the set of computed views * / 

end; 



7 Experimental Evaluation 

7.1 Experimental results in Time and Space 

We have evaluated our technique by implementing the algorithms described in 
Section 5 and 6. We compared our approach in time and space with the con- 
ventional utilising the smallest parent method as described by [GBLP96]. The 
proposed algorithms were run on a Dual Pentium 200 MHz with 128 Mb of 
RAM under windows NT and with 1 GB of virtual memory. No attempt was 
made to utilize the second processor. Six datasets were used three of which 
were from the TPC-D [TPG98] benchmark dataset. We have also used two real 
datasets, the weather data [HWL94] the adult dataset [Koh96] and a small syn- 
thetic dataset [Kim96]. From the TPG-D dataset, the lineitem table was chosen 
at three scale factors 0.1 (600K tuples), 0.01 (60K tuples) and 0.001 (6K tu- 
ples). In the TPG-D datasets the measure of interest is the fifth attribute. In 
all experiments the number of dimensions increases in the order the dimensions 
appear in the dataset (i.e. where 3 is the selection of the first three dimensions). 
The TPC-D 6K dataset. The 6K dataset’s attributes’ cardinalities were: (1,500), 
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(200), (10), (7), (50), (3), (2), (2,249), (2,234), (2,241), (4). Figure 3(a) shows the tim- 
ing results taken from the TPC-D 6K dataset. Although the T-R views approach is 
slower until the five-dimension data cube, after the six-dimension data cube it is faster 
than the conventional data cube. Figure 3(b) shows the results in space savings where 
our approach uses 3.8 times less space than the conventional approach. The TPC-D 
60K dataset. For the 60K dataset, the attributes’ cardinalities were: (15,000), (2,000), 
(100), (7), (50), (35,967), (2,520), (2,464), (2,531), (4), (7). Figure 4(a) presents the 
time required to compute the data cube by both the T-R views approach and the con- 
ventional approach. The T-R views approach becomes faster after the five-dimension 
data cube and the increase in performance remains faster in higher dimensions. Fig- 
ure 4(b) shows space savings of our approach which uses 3.6 times less space than the 
conventional. The TPC-D 600K dataset. The 600K dataset was generated with 
the scale factor sf=0.1 and cardinalities: (150,000), (20,000), (1,000), (7), (50), (2,526), 
(2,466), (2,548). For the 600K dataset, the results are similar to the smaller TPC-D 
dataset. Figure 5(a) shows the timings for our and the conventional approach. After the 
fifth dimension our approach remains faster than the conventional. Figure 5(b) shows 
that our approach is approximately 2 times more economical than the conventional ap- 
proach. The Hotel dataset. This dataset is taken from a business example [Kim96]. It 
is a synthetic dataset with a fact table of Hotel Stays schema with eight dimensions and 
three measure attributes. The size of the dataset is 2,249 tuples and it was selected to 
ensure that our performance results in time are not effected by thrashing. The dataset 
and all its derivative aggregates, even in the conventional approach, almost fitted into 
the main memory. The cardinalities of the dataset were: (183), (26), (100), (20), (20), 
(2,168), (2), (20), (4), (907), (366) and (10). In time aspects Figure 6(a) shows that 
beyond the fifth dimension the performance of the T-R approach increases. Figure 6(b) 
shows that the g-equivalent representation is 7.6 smaller than the conventional - as a 
result of the high number of Observational Keys which have been identified from the 
Key Algorithm. The Cloud dataset. We used the real weather data on cloud cover- 
age [HWL94]. The dataset contains (116,635) tuples with 20 dimensions. The following 
10 cardinalities were chosen : (612), (2), (1,425), (3,599), (5), (1), (101), (9), (24). The 
measure attribute was the sixth dimension. The time results in Figure 7(a) show that 
the T-R approach is not faster than the conventional approach. The reason for this is 
the high skewness of the dataset which resulted in the non-existence of Observational 
keys in the input base relation. Also the results in Figure 7(b) show that the space 
used by our approach is 1.13 times less than the conventional approach. This is lower 
than in previous cases using TPC-D datasets. However, Observational keys have been 
found in this dataset, in the derivative relations and the savings have been shown to 
be much greater than 1.13 (see Figure 11). The Adult Database. This is a real 
dataset from the US Census Bureau in 1994 which was first presented by [Koh96]. Six 
of the attributes are numerical and the remaining six are categorical attributes. The 
down-loaded version and was projected to a relation with nine attributes, with the 
9th attribute defined as the measured attribute. The cardinalities of the new dataset 
were: (72), (10), (19,988), (16), (16), (16), (7), (16), (7), (1). The last attribute was used 
for measure of interest. Figure 8(a) compares the timings from both the T-R and the 
conventional approaches. The time of the T-R approach is 60% slower than the con- 
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ventional approach. The skewness of the data distribution causes the non-existence of 
Observational keys in the base relation. This is confirmed by the space savings results 
shown in Figure 8(b), which are not as great as those in the synthetic datasets. In the 
next section, we show that savings increase when the Observational key is searched in 
the derivative relations (see Figure 11(b)). 

7.2 Totally-Redundant Views of Derivative Relations 

T-R views can be found in aggregates other than the base relation. The experiments 
conducted compare the savings in space when the base relation is used as a reference to 
the savings achieved when the derivative relations are used as references. On average, 
for a base relation with ten dimensions, when utilising all derivative relations, the total 
savings increase to, on average, 90% of the total views in the data cube. This is in 
contrast to an average of 74% of total views achieved when the base relation was used. 
For each of the six datasets, the tests were run in several dimensions with the number 
varying from three to twelve. The savings in space and time after the elimination of the 
T-R views are greater for relations with high dimensionality. Figure 9(a) compares the 
volume of the T-R views, for the 600K dataset, when they are g-equivalent to the base 
relation with the volume of the T-R views when they are g-equivalent to the derivatives. 
For the 600K dataset in seven dimensions, the results show a further redundancy 
of 23.5% compared to those found in the base relation. Figure 9(b) illustrates the 
results of experiments using the TPC-D 60K dataset. This shows that redundancy 
when the derivative aggregates have been searched is approximately 15% more than the 
redundancy found by utilising only the base relation. Figure 10(a) compares the same 
quantities for the 6K dataset. The redundancy found in this dataset is approximately 
21% more than that found by using the base relation as a reference for the g-equivalence 
property. For the hotel dataset, the redundancy in ten dimensions was 13.3% more 
for the derivative relations method than that of the base relation. These results are 
illustrated in Figure 10(b). The Key-algorithm when applied to the derivative relations 
is particularly effective in the real datasets. Results for the weather dataset reveal that 
redundancy is 48.41% more than that found by the simple key algorithm. The simple 
Key-algorithm had identified only 10.26% of the data cube redundancy, compared to 
tests run in the derivative relations, in which the redundancy was 59.93%. Figure 11(a) 
shows the results of six different data cube trials, for the weather dataset. Figure 11(b) 
shows results for the adult dataset. For the simple Key-algorithm, redundancy of T-R 
views (based on the base relation) was 1.88%, in contrast to the redundancy identified 
in derivative relations of 65.28%. Savings in storage and time for the T-R approach 
vary according to the distribution and sparsity of the trial datasets. Experiments in 
real datasets, utilising the simple Key algorithm, show that the performance of T- 
R approach was negatively affected by the skewness of the datasets. The aim of the 
following experiments was to identify the redundancy in derivative aggregates and 
show that applying the Key-algorithm recursively can lead to substantial increases 
in space savings. The performance in time is strongly related to the savings in space, 
since eliminating redundant views results in less computational effort. The experiments 
conducted in this section revealed that the performance of T-R views is affected less 
by the data skewness when the derivative relations are utilised as parents. 
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8 Conclusion 

We have shown that the identification of redundancy in the data representation 
of multidimensional aggregates is an important element in the efficient imple- 
mentation of the data cube. The cost of implementing the whole data cube, when 
the Totally-Redundant views approach is utilised, is the cost of computing and 
storing the non-redundant views. Thus, the larger the percentage of redundant 
views in the data cube, the faster the computation compared to the conventional 
approach. After the identification of non-redundant views, the whole data cube 
is materialised without additional cost. We propose extensions to relational the- 
ory to identify by inspection those views which are g-equivalent to their parent 
input relation. The proposed algorithms were evaluated under experimental re- 
sults. The experiments demonstrate the significance of our approach with regard 
to the reduction of storage costs without a significant compromise in time. The 
data skewness negatively affects the existence of Observational keys in the base 
relation. However, when the derivative relations are used as parent relations, 
the data skewness does not affect the existence of keys, and the savings increase 
significantly. Potentially savings in space mean that a smaller number of views 
are materialised. This results in less computational effort. Furthermore, we dis- 
cussed the implications of our approach in indexing, maintenance and the user 
interface. In future work, we will examine the effects of the redundancy in the 
above areas more thoroughly. 
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Abstract: Data warehouses typically store a multidimensional fact representation of the 
data that can be used in any type of analysis. Many applications materialize data cubes 
as multidimensional arrays for fast, direct and random access to values. Those data 
cubes are used for exploration, with operations such as roll-up, drill-down, slice and 
dice. The data cubes can become very large, increasing the amount of I/O significantly 
due to the need to retrieve a large number of cube chunks. The large cost associated 
with I/O leads to degraded performance. The data cube can be compressed, but 
traditional compression techniques do not render it queriable, as they compress and 
decompress reasonably large blocks and have large costs associated with the 
decompression and indirect access. For this reason they are mostly used for archiving. 
This paper uses the QuantiCubes compression strategy that replaces the data cube by a 
smaller representation while maintaining full queriability and random access to values. 
The results show that the technique produces large improvement in performance. 



1. Introduction 

The amount of information stored and analyzed in data warehouses can be very large. 
Data cubes are frequently used to materialize views of data sets for Online Analytical 
Processing (OLAP). They can be very large and must be stored and retrieved from 
disk in costly I/O operations. 

The multidimensional data cube representation uses fixed-size cells and has the 
advantages of full queriability and random access, as the position of any value is 
directly computable from the offset on each dimension axis. The major drawback of 
this representation is that it can occupy a lot of space, causing large access overheads. 
This has lead us to investigate compressed representations of normal data cubes that 
would reduce the number of I/O operations needed without obstructing queriability 
and random access. 

Compressing the data cube has important advantages. The most obvious one is that it 
saves space (between 50% and 80% when QuantiCubes is used). However, the largest 
advantages are consequences of the smaller space usage. Whenever the memory is 
scarce compared to the size of the data sets composing the workload, those savings 
can mean less disk swapping, as more data fits in memory buffers and cubic caches. 
An even more important effect is that the most time consuming queries over data cube 
chimks stored on disk become almost 2 to 5 times faster simply because only one half 
to one fifth of the data has to be loaded from disk (chunk storage organization of data 
cubes on disk is discussed in [6]). 

Traditional compression techniques are not useful in this context. They 
achieve large compression rates that are very useful for archiving (e.g. Arithmetic 
Coding, Lempel-Ziv, Huffman Coding, ...) [2,5, 7, 8]. However, the compressed data 
sets are not directly queriable without prior decompression of whole blocks, due to 
the variable-length of the compressed fields. The term “variable-length field” refers 
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to the usage of a variable number of bits to represent compressed values. 
Compression/decompression is done by blocks and require expensive access methods 
to locate the correct block. This indirect access feature accounts for a large overhead. 
The decompression overhead of most typical techniques is also significant, as it 
includes bit processing operations and access to auxiliary structures such as 
dictionaries or code trees to decode the bitstream. 

In this paper we present the QuantiCubes technique, which packs reduced 
fixed-width representations of the individual cells to compress the data cube as much 
as possible for faster analysis, while maintaining the full queriability and random 
access properties of the uncompressed data cube. These results are only possible 
because QuantiCubes guarantees insignificant decompression overhead and maintains 
the same degree of random access and queriability of the uncompressed data cube, 
because it stores fixed-width compressed values. The technique is presented and 
tested in this paper. 

The paper is organized as follows: section 2 presents the QuantiCubes 
technique. Section 3 presents experimental results and section 4 concludes the paper. 



2. The QuantiCubes Technique 



QuantiCubes transforms the normal data cube representation. The data is 
analyzed first to determine a fixed-width compressed coding that is optimal with 
respect to a set of error and space criteria. This is achieved using an adaptable 
quantization procedure. The data cube values are then coded and packed to achieve a 
very large compression rate. For example, a data cube cell occupying 4 bytes is 
compressed to the size of the bitcode that was determined for that cell (e.g. 9 bits). 
Finally, extremely fast decompression of individual values “on-the-fiy” is achieved by 
unpacking and dequantizing the values with very fast, low-level operations. 



2.1. Data Analysis 

Quantization [4] is used in QuantiCubes to determine the set of representative 
values for an attribute. The crucial issue of accuracy is guaranteed by our new 
proposal for an “ adaptable” quantization procedure. This adaptable procedure uses an 
accuracy constraint based in user-defined or preset error bounds during the 
determination of quantization levels. This way, the number of levels that are 
necessary is determined adaptively and attributes that cannot be estimated with large 
accuracy using a reasonable number of quantization levels can be left uncompressed. 
The new adaptable quantization algorithm is not discussed here for lack of space. A 
complete discussion of the algorithm is available in [1]. 

The specific OLAP context is important in the adaptable quantization 
procedure, because many fact attributes measuring quantities such as daily or periodic 
sales offer favorable cases for the replacement by a number of representative values, 
as they are often distributed around clusters and repeated values. The best results 
(most accurate and with less bits) are achieved when the data set is made of a small 
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set of distinct values that are repeated all the time. In that case the quantization levels 
coincide with distinct values and the bitcodes are simply a more compact 
representation of the values. In the extreme case, an attribute could have a small 
number of distinct values in a very large number of rows (e.g. lOM rows). 
Quantization also returns very good approximations when a significant fraction of the 
values are distributed around sets of clusters, in which case reconstruction values 
converge to the centers of distinct clusters. As more levels are added, the quantizer 
finds the most appropriate positions for the quantization levels to represent most 
values accurately. 

Both the analysis and experimental results in [1] show that the approach 
guarantees accuracy and typically provides almost exact value reconstruction and 
precise query answers. This is an important issue, as the technique maintains the 
queriability of uncompressed data but is also able to guarantee high levels of accuracy 
for every query pattern and even in the reconstruction of individual values. 



2.2. Reconstruction Array 

As with many other compression techniques, an auxiliary structure must be 
accessed to convert between values and bitcodes. However, a small reconstruction 
array is used and a single direct access to a cell is required for decompression. Figure 
1 shows an example that illustrates a reconstruction list. It is a simple array ordered 
by the reconstruction values with typical sizes between 64 and 8192 values, 
corresponding to bitcodes with 6 to 13 bits. In this list, bitcodes coincide with the 
index of the cell. The array is accessed to retrieve the bitcode (index in the array) or 
the reconstructed value (the value in each position of the array). 



3.234 


9.21 


14.764 


21.37 


■ 


102.54 



Figure 1 - Illustration of a Reconstruction List 

During compression, the reconstruction array is accessed to translate values into the 
bitcodes of the reconstruction values that are nearest to those values (the bitcodes 
coincide with the indices, as mentioned above). During decompression, the bitcodes 
are translated back to the corresponding reconstruction values. 



2.3. Compression 

Compression requires an access to the reconstruction array to retrieve the bitcode 
that replaces the value and a concatenation of that value into a word. Each word 
contains a small number of packed bitcodes and is sized to minimize the overhead of 
decompression [1]. 

The lookup in the reconstruction array requires a binary search of the array to 
determine the quantization level. The search must find the reconstruction value that is 
closest to the value that must be compressed and return the corresponding index. This 
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bitcode is then packed (concatenated) into the word using two low-level operations 
(shift = «, OR=|, AND=&): bitcodci = ( (data register «shifti) | maskregister;) ; 



2.4. Decompression 

The most crucial operation is decompression, as it is used for “on-the-fly” 
reconstruction of values during the processing of queries. It requires unpacking the 
bitcodes and a direct access to the reconstruction array, which is conveniently small to 
fit in the microprocessor internal cache after a few initial misses. Unpacking uses the 
operation: bitcodci = ( (data_register »shifti) & maskregister;); 

The reconstruction value is retrieved using a direct access to the reconstruction list: 
Reconstructed_value = Reconstruction_List[bitcode]; 

With this strategy, the unpacking operations are extremely fast. 



3. Analysis and Evaluation of the Technique 

We have measured 2 to 5 times improvement in aceess to the eompressed data 
eube on disk, due to the mueh smaller number of aceesses that is necessary to load the 
eompressed eube, while also saving a large amount of spaee. This is shown in Figure 
2, whieh eompares the time taken to load a normal and a compressed data cube, when 
the eompressed data eube has 14 of the size of the original data eube, as shown in the 
first column. 



Size(MB) 

Normal/Cmprss 


Cmprss 


UnCmprss 


100/25 


11.837 


49.431 


200/50 


23.434 


104.01 


400/100 


48.54 


212.158 


600/150 


73.814 


333.631 


800/200 


112.99 


495.07 



Figure 2. Loading Time (secs) 

In fact, the data cube does not need to be decompressed before operation or 
analysis even when in memory, as online decompression of individual values does not 
degrade the performanee. This property saves a signifieant amount of memory buffer 
spaee. Assuming that Y values are operated upon in the OLAP query (we use the 
example Y =100 million values with 4 bytes eaeh - 400MB - and 4 times 
eompression rate), the following operations are necessary: 

Uncompressed data cube: 

4Y bytes loaded from memory (100 M values = 400 MB) 

+ Y X aggregation-related operations (100 M operations) 

Compressed data cube: 

Y bytes loaded from memory (100 M values = 100 MB) 
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+ Y unpack operations (100 M ops) + M lookup operations (100 M ops) 

+ Y X aggregation-related operations (100 M ops) 

This diseussion shows that operation on the compressed cube aceesses a 
much smaller number of memory words, but the bitcodes must be unpacked and 
looked up. Predicting which of these operations is fastest is not trivial. It depends on 
the scheduling of instructions, the time taken to execute them and the use of internal 
eaches by the microprocessor in the lookup funetionality. However, by compiling and 
executing the code in two different systems, it was possible for us to conclude that the 
overhead incurred by the deeompression is not larger than the overhead required to 
feteh the exeess of uneompressed values. Figure 3 shows the time taken to operate on 
the eompressed and uneompressed data eubes and also on an optimized version of the 
uneompressed data cube that uses loop unrolling [3] to try to match the speed of 
operation on the compressed data cube. This code was exeeuted in a Pentium II- 
300Mhz proeessor. 



Size(MB) 


Cmprss 


UnCmprss 


UnCmprss, 


Normal/Cmprss 






Unrolled 


25/6.25 


0.601 


1.352 


0.610 


50/12.5 


1.132 


2.744 


1.192 


100/25 


2.353 


5.498 


2.354 


150/37.5 


3.455 


8.232 


3.555 


200/50 


4.607 


11.026 


4.736 



Figure 3 - CPU Execution Time in Sum Operation 

The adaptable quantization procedure produces an approximate data set. However, 
our new proposal for adaptable quantization allows the algorithm to find the 
appropriate number of levels that are neeessary. Figure 4 shows the relative error 
(average, maximum and standard deviation) of the reconstuetion of individual values 
taken by the attribute “Aceount Balanee” of a bank data set as more bits are added to 
the quantizer (each additional bit doubles the number of levels available). This 
attribute had irregular distribution. The result shows that the strategy is able to adapt 
the number of levels imtil the error constraints are met. In this ease, the quantity was 
well estimated using 10 to 12 bits. 




Figure 4- Adaptable Quantization of Attribute Account_Balance 
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Our experimental evaluation included other experiments that are not shown for 
lack of space. A more complete discussion and set of experiments is included in [1]. 
Those results include the comparison of decompression overhead with other 
compression techniques, showing that QuantiCubes is at least 10 times faster. The 
compression overhead was also tested experimentally. Although compression is 
slower than decompression, we have determined experimentally that the time taken to 
compress and store the data with QuantiCubes is not larger than the time taken to 
store the uncompressed data, because the compressed data cube is much smaller. 



Conclusions and Future Work 

In this paper we have proposed a technique (QuantiCubes) which is able to 
compress data cubes 2 to 5 times and still maintain complete queriability and random 
access. We have shown that the decompression overhead is completely insignificant, 
so that “on-the-fly” decompression of any number of values is possible in any 
situation without significant degradation of performance. The technique is also able to 
deliver 2 to 5 times faster operation than the uncompressed data. The fast operation of 
the technique is analyzed in detail and confirmed using experimental results. The 
technique is used in the context of analysis and operation on the data cube. It is not 
intended to replace data compression techniques in their archiving or longer-term disk 
storage function, as they achieve very large compression rates using a variable-width, 
exact representation of the data set. 
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Abstract. Views over distributed information sources rely on the sta- 
bility of the schemas of these underlying sources. In the event of meta 
data changes in the sources, such as the deletion of a table or column, 
such views may become undefined. Using meta data about information 
redundancy and user preferences, views can be evolved as necessary to 
be defined on a modified information space after a source meta data 
change, while assuring the best possible compliance to preferences that 
users may have. 

Previous work in view synchronization focused only on deletions of schema 
elements. We now offer the first approach that makes use of additions 
also. The algorithm is based partly on returning view definitions to 
previous versions by “backtracking” in the history of views and meta 
data. This technology enables us to adapt views to temporary meta 
data changes by cancelling out opposite changes and allows undo/redo- 
operations on meta data without deteriorating the quality of the view. 
The mechanism described in this paper will therefore improve the quality 
of evolved views. 



1 Introduction 

In recent years, the number of digital information storage and retrieval systems 
has increased immensely. Many of those database systems are available in some 
kind of network, and the task of integrating data from different source databases 
is an increasingly important one. Solutions to this problem include data ware- 
houses, a term which in the database community refers to materialized views 
over distributed information sources. Newer developments, such as the ubiquity 
of the World Wide Web, emphasize the independence of data producers from 
data consumers. Independent data producers or providers have control over the 
schema of their information sources which raises the question of the influence of 
meta data updates (deletions of attributes or relations in underlying databases) 
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on a view [RKZOO]. In traditional views (as introduced in the literature han- 
dling data updates, e.g., [BLT86, AASY97]), meta data updates can render a 
view definition undefined. 

Some work has been done in the EVE-Project [RLN97, NLR98, KRH98] 
that deals with the rewriting of views under schema updates in the underlying 
sources in the case of schema element deletions, retaining all or a part of the 
view extent in the new rewritten view. We described algorithms allowing gen- 
erated view rewritings to be non-equivalent to the original query [LNR97], and 
presented a model for a numeric assessment of the quality and cost of such rewrit- 
ings [LKNR99], allowing for the automatic selection of the best view rewriting 
among many generated ones. 

One drawback of the previous work mentioned is the finality of a view syn- 
chronization operation. After the deletion of an underlying relation that has been 
used by a view, the view is rewritten to not refer to that relation any more. Even 
if the same relation is later added back to the information space (for instance, 
after a temporary unavailability due to a network problem) , it will never be used 
by the view again since without a global domain model the view synchronization 
algorithm cannot determine in what relationship a new data element stands to 
other previously available elements. 

Note that with all view synchronization algorithms that have been devised so 
far, only one-step view synchronization is possible (cf. Section 3). The existing al- 
gorithms POC/SPOC [NR98], SVS [RLN97], CVS [NLR98], and GRASP [NR98] 
cannot synchronize a view under a sequence of schema changes. Rather, they gen- 
erate a new view definition after each schema change in an information source, 
without taking any information about earlier IS changes into account. Also, they 
do not support any meaningful synchronization under add— capability-changes. 
This work describes a way to overcome this shortcoming of previous approaches. 

In this paper, we present a complete approach capable of handling all infor- 
mation source schema changes, namely adds, renames, and deletes of attributes 
and relations. The main contribution of this paper is the utilization of additional 
available meta data to keep views as close to their original definition as possible, 
under a sequence of meta data changes that occurs over time. 

2 Related Work 

Materialized views over distributed information sources have been explored for 
a number of years. First work focused on questions of materialized view mainte- 
nance under data updates in the sources [BLT86, GM95]. Work on maintenance 
of the more recent mediator-based (constrained) heterogeneous database sys- 
tems was done by Lu et al. [LMSS95]. More recently, questions of optimizing 
view queries given varying parameters or capabilities of underlying sources have 
also been explored [AASY97, GMHI+95]. Generally, work in this area assumes 
that the rewritten view query computes a view extent equivalent to the origi- 
nal one. Prominent approaches that deal with equivalent query rewriting include 
work by Jarke et al. [JK84] or Agrawal et al. [AASY97]. Important work on inte- 
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grating heterogeneous sources in one view using a common semistructured data 
model (OEM) has been done in the TSIMMIS-project [GMHI+95]. Chaudhuri 
and Shim [CS96] present work on query optimization under user-defined SQL 
predicates (e.g., stored procedures), which deals with finding optimized query 
plans for such queries but does not consider changing or non-equivalent queries. 
Finkelstein [Fin82] describes a model to achieve query optimization in an envi- 
ronment in which a sequence of related (similar) queries is executed on a query 
engine (using query- graphs). 

Work on rewriting queries using views (e.g., [LMSS93]) is used in subsequent 
work by Levy et al. which is closely related to ours in terms of its goal, but not 
the approach taken. In [LSK95], the notion of the world-view is introduced as a 
global, fixed domain model of a certain part of the world on which both infor- 
mation providers and consumers must define views. This work is in some sense 
an approach inverse to ours [RLN97] but relies on the existence of fixed, invari- 
ant “world model” which is unlikely to be available in practice. Furthermore, 
changes to the world model are not possible in this approach or would require a 
manual redefinition of both information providers’ and consumers’ queries. 

Related approaches to Levy’s are Arens et al. [AKS96] and the SoftBot 
project [EW94]. The SoftBot project has a different approach to query processing 
as they assume that the system has to discover the “links” among data sources 
that are described by action schemas. While related to our view synchronization 
algorithm CVS [NLR98], the SoftBot planning process also relies on discovering 
connections among information sources when very different source description 
languages are used. Neither SIMS nor SoftBot address the particular problem of 
evolution under capability changes of participating information sources. 

3 Background: The EVE-System 

Since our work is based on the EVE-Project [RLN97], we give a short overview 
over those elements of the EVE-System that we need as background in this paper. 
The algorithm described in this paper builds on top of the one-step synchroniza- 
tion algorithms as described in the EVE-Project [RLN97, NLR98, KRH98]. As 
we will see, the new algorithm in some cases falls back on these old algorithms 
and performs one or a sequence of single-step view synchronizations. 

E-SQL or Evolvable-SQL is an extension of SQL that allows the view definer 
to express preferences for view evolution [LNR97]. A user defining a view can 
specify what information is dispensable and what information is replaceable by 
similar information from other information sources (ISs) , and whether a changing 
view extent is acceptable. Those user preferences are expressed as parameters 
that can be attached to basic SQL query elements. For example, the qualifier 
AD declares an attribute dispensable, RR means a replaceable relation, and V£ 
(which is one of C, D, =, ?s) expresses the relationship between the original and 
rewritten query as required by the view definer. The default for V£ is =, for all 
other parameters it is false. 

Example 1. A typical E-SQL query would be: 
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CREATE VIEW Asia- Customer (VS = “D”) AS 

SELECT Name, Address, Phone (AT> = true, ATZ = true) 

FROM Customer C (IZIZ = true), FlightRes F 

WHERE (C.Name = F.PName) AND (F.Dest = ‘Asia’) (CD = true) 

View definitions are stored in a View Knowledge Base (VKB). Once a view is 
defined, EVE view synchronization algorithms are invoked after schema changes 
in all ISs participating in this view and attempt to find replacements for missing 
view elements. For the purpose of identifying view element replacements from 
other ISs, we express relationships between ISs through constraints (e.g., agree- 
ing data types, functional dependencies between attributes, and extent overlaps 
between relations, all of which are stored in the Meta Knowledge Base (MKB)). 
One important constraint is the eontainment-eonstraint between two relations, 
stating that a (horizontal and/or vertical) fragment of a relation is semantically 
contained or equivalent to a (horizontal and/or vertical) fragment of another. 
Examples of view synchronization can be found in the technical report to this 
paper [KROO]. 

Since view synchronization algorithms may generate many possible query 
rewritings, one needs to be selected as the new view definition. For this purpose 
we have developed the QC-Model [LKNR99] to estimate the quality and eost 
of the rewritings. Each legal query rewriting will in general preserve a different 
amount (extent) and different types (interface) of information, which we measure 
through a metric that we have defined on views and refer to as the quality of 
the view. Also, each new view query will cause different view maintenanee eosts, 
since in general data will have to be collected from a different set of ISs. With 
these two dimensions, the QC-Model can assign a numerical QC- Value to each 
query rewriting and thus eompare different rewritings with each other. 



4 Modeling Meta Data and Meta Data Changes 

4.1 Meta Data Model 

In the EVE-System as introduced by [RLN97], View and Meta Knowledge Bases 
(VKB and MKB) exist to maintain data warehouses over changing information 
sources. MKB data is encoded in the form of schema definitions and constraints. 
In this sense, we will describe schemas of underlying information sources by 
Attribute Constraints and Relation Constraints, and for the purpose of this 
paper, we will concentrate on a total of four different constraints that comprise 
our MKB: attribute eonstraints (AC) define attribute names, relation eonstraints 
(TZC) define relation names, eontainment eonstraints (CC) define relationships 
between relations, and eontainment pair eonstraints (CVC) relate attributes in 
those relations to each other (i.e., define projections of relations who are in a set 
relationship to each other) . We also introduce two functions that relate attributes 
with their relations (rel: AC —i TZC) and CECs with their CCs (con: CVC CC). 
For details, refer to our technical report [KROO]. 
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4.2 Changes in Meta Data 

Changes to the schemas of sources which are modeled in the Meta Knowledge 
Base cause MKB and VKB changes. This paper describes how to synchronize 
view definitions with such meta data changes under consideration of previous 
meta data changes in the information space, and how to make use of added meta 
knowledge by “re-installing” a previously deleted schema element. In order to 
better specify the meaning of “previously deleted element”, we introduce the 
concept of history of view and meta knowledge to our system, starting with a 
formal definition of schema changes. 

Definition 1 (meta data change). Let Ci he a constraint (AC, TZC, CC, or 
CVC, cf. Section 4-1). A meta data change A4T>Ci is the addition or deletion of 
Ci to or from the Meta Knowledge Base, transforming the MKB from a previ- 
ous state MKBi to a subsequent state MKBi^i. Valid meta data changes are: 
delete-C"PC (Ci), delete-CC (Ci), delete-7?.C (Ci), delete-.4C (Ci), add-CPC (Ci), 
add-CC (Ci), add-TZC (Ci), add-.4C (Ci). Each A4T>Ci has as a parameter exactly 
one constraint in the MKB which is denoted by constraint(AI'DCj). 

We also introduce the notion of an inverse meta data change operation. 
Two meta data changes are inverse to each other if they are add- and delete- 
operations, respectively, and refer to the same constraint (e.g. delete-TZC(TZC\) = 
(add-TZC(TZCi))-^). 

4.3 Evolution of the MKB under Meta Data Changes 

For the eight operations in Definition 1, the respective constraint is simply added 
to or removed from the MKB. In addition to that, some changes imply others. 
For example, the meta data change operations delete-KC and delete-CC trigger 
deletions of Attribute Constraints and Containment Pair Constraints, respec- 
tively. Those implicit operations are executed on the MKB immediately after 
the meta data change is detected, but before the view synchronization algo- 
rithm is run on the view definitions in the VKB. Similarly, we assume (or gen- 
erate) subsequent additions of sub-constraints in the same fashion for add-KC 
and add-CC-operations. In the case that the set of attributes in a relation has 
changed while this relation was unavailable, we assume that appropriate add- AC 
or delete-AC operations are generated after the add-TZC operation. Note that re- 
names of relations or attributes cause all TZC or AC in all MKB states to be 
renamed at once and do not constitute a MVC in the sense of Definition 1. 



4.4 Dependency of Views on Meta Data 

In previous view synchronization algorithms [RLN97, NLR98, KRH98] the rewrit- 
ing of a view is accomplished with the help of constraints that state (partial) 
redundancy in the information space. For instance, a containment constraint 
between R and S can be used to replace relation R by relation 5 in a view if 
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the former becomes unavailable. The meaning of that constraint is that the in- 
formation provider (s) of relations R and S guarantee that the constraint is and 
remains valid. Even when relation R becomes unavailable, this condition will 
still be kept valid. If this CC-constraint would also be deleted from the MKB, it 
could not be utilized by the view synchronization process. We define a rewritten 
view to be constraint- dependent on a CC-constraint if the latter was used in 
the above sense in the rewriting of a view V. A formal definition can be found 
in [KROO]. Note that any further rewriting of view can, but does not have 
to, retain the view’s constraint-dependency on any constraint. 

4.5 Modeling the History of Meta Data and Views 

We will now define a way to model the history of meta data and views, i.e., the 
sequence of changes in meta data that have occurred on a MKB and VKB. 

Definition 2 (history graph). Let a view V be rewritten into by a meta 
data change sequence S = {MVCi, A4VC2, ■ ■ ■ , J^T>Cn} ■ The history 
of view rewriting is a graph R = {N, E) with N the set of nodes representing 
the states of both MKB and VKB at a certain point in time, and E the (labeled 
directed) edges representing transitions from one state to the next (meta data 
changes). Each node Ni is a pair < M K Bi,V K Bi> and each edge Ei is labeled 
with the meta data change MVCi it represents. Node Nq contains the original 
Meta Knowledge and View Knowledge Bases (i.e., the meta knowledge at view 
definition time and the originally defined views) while Nn represents the current 
state of the system. 

5 The History-Driven View Synchronization Algorithm 

View synchronization has to be performed when a view is affected after a schema 
change in the information space. Views whose definitions are still valid after the 
change do not have to be synchronized. A view V is affected by a schema change 
if (1) a constraint on which V was constraint-dependent is added to the MKB or 
(2) a constraint on which V is currently constraint-dependent is removed from 
the MKB. 

The History-Driven View Synchronization Algorithm (HD-VS), which is ex- 
ecuted after a schema change for each affected view, uses three main concepts: 
backtracking in the history of a view, re-applying a part of a meta data update 
sequence from that history, and reconstructing part of the view’s history graph 
in the process of re- application of meta data changes. Figure 1 gives an exam- 
ple of the algorithm. Four meta data changes are executed on a view V. After 
each meta data change, HD-VS checks whether the history of the view contains 
information that could be used to rewrite view V in an appropriate way. Two 
consecutive delete- constraint operations both lead to backtracking in the history 
of the view but do not find an appropriate inverse operation to cancel with. In 
both cases, HD-VS falls back on a single-step view-synchronization algorithm 
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(e.g., POC). The third operation, an add- constraint operation, also leads to un- 
successful backtracking but in the case of an addition to the information space, 
no POC-view synchronization is necessary for the view to remain valid. The 
fourth operation, an add-operation inverse to a previous deZete-operation, leads 
to successful backtracking. Here, the two operations add-X and del-X cancel 
out. HD-VS applies a meta data -sequence to V that does not contain these 
two operations, but all others. Then, the remaining two operations {del-Y and 
add-U in our example) are applied in the above mentioned fashion, but do not 
lead to further cancellations, so the final view obtained is V2 ■ 




Fig. 1. Example Sequence for the History-Driven View Synchronization Algorithm (HD-VS). 



5.1 Backtracking in the History of a View 

In order to perform view synchronization, HD-VS applies a technique that we 
will call backtracking. The purpose of the backtracking process is to find a view 
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Vi in the history of V from which to re-apply or redo a changed meta 

data update sequence. The idea of backtracking now consists in finding a node 
in T-L{V) for which a certain condition C holds. 

Definition 3 (backtracking). Backtracking in the history of a view 

rewriting after a meta data change AiT>Cn+i to a constraint Ci is the traver- 
sal of the graph starting from node Nn through nodes Nn-i, Nn- 2 , ■ ■ ■, 

until the following condition is satisfied: 

1. if MT>Cn+i is an add-constr. -oper. ; is constraint-dependent on Ci or 

2. if AiT>Cn+i is ® delete-constr.- oper.; is not constraint- dependent onCi. 

The backtracking process returns the node Ni with the highest index i (i.e., the 
most recent view) satisfying the above condition. 

The motivation behind this definition of backtracking is as follows: If a con- 
straint Ci is added to the MKB, we want to determine whether this constraint 
has previously been used in the history of view V. In this case, C, must have been 
deleted from the MKB earlier and we can cancel the add with the delete of C,. 
Therefore, we backtrack to a view that used Ci, since must precede the 
deletion of constraint C, . In the opposite case of a delete-constraint-operation on 
constraint Ci, we look for the last view that was not constraint-dependent 
on Ci since we can then re-apply the remaining schema change sequence from 
on under the assumption that C, was never part of the MKB. 



5.2 Re-applying a Meta Data Update Sequence 

After backtracking to a certain view a meta data change sequence S' derived 
from the original sequence S has to be applied to In the first case of add- 
constraint-operations, we cancel out the corresponding previous delete-operation 
from S and apply this shorter sequence. In the second case of delete- constraint- 
operations, the constraint C is removed from the MKB. The sequence S is then 
applied to this changed MKB, without C being available for any view synchro- 
nization step. In the case of a delete-CC or delete-TZC, all dependent CVC or AC 
are also deleted from the MKB and subsequent add-CVC- or add-MC-operations 
are removed from S. A formal definition of this process is given in [KROO]. 

5.3 Reconstructing the History Graph 

As the sequence S' is applied to MKBi, new nodes 7Vj+i, . . . ,Nj{j < n) are 
created and old nodes are discarded from the graph. Lastly, a new node Aj+i 
is added to TL{V), containing MKBj^i and VKBj^i and connected with Nj 
through an edge i?j+i labeled with meta data change MVCj+i . 

In the technical report [KROO] , we handle the case when backtracking reaches 
the beginning of the meta-data change sequence. We also analyze details of the 
different meta data change operations, such as co-dependencies between different 
operations, and give examples for a number of cases. This is omitted here due 
to the limited space. 
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6 Evaluation 

In the technical report accompanying this paper [KROO] , we give several theorems 
on the correctness and advantages of the HD-VS algorithm. We prove that (1) 
after a valid meta data change that does not lead to backtracking, our algorithm 
will give a valid view, and (2) backtracking and re-applying a meta-data change 
sequence as discussed in Section 5 will always lead to valid view rewritings. We 
also show that under both add- and delete-schema-changes, our algorithm will 
give equal or better view rewritings than previous algorithms. 



7 Conclusion 

In this paper, we present a solution for view synchronization under meta data 
updates. With our algorithm, a view that is defined over a given set of relations 
and attributes can be synchronized with additions and deletions of both. Pre- 
vious approaches [RLN97, NLR98] to this problem could only adapt views to 
deletions in the information, but not to additions, so that, for example, tem- 
porary unavailability of information sources could not be accounted for. This 
technology can be applied to views defined over distributed information sources, 
especially as those sources are independent and prone to change their schemas 
or query capabilities. 
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Abstract. A data warehouse is a software infrastructure which supports OLAP 
applications by providing a collection of tools for data extraction and cleaning, 
data integration and aggregation, and data organization into multidimensional 
structures. At the design level, a data warehouse is defined as a hierarchy of 
view expressions whose ultimate nodes are queries on data sources. In this 
paper, we propose a logical model for a data warehouse representation which 
consists of a hierarchy of views, namely the base views, the intermediate views 
and the users views. This schema can be used for different design purposes, as 
the evolution of a data warehouse which is also the focus of this paper. 



1 Introduction 

Data warehousing is a software infrastructure which supports OLAP applications by 
providing a collection of tools to (i) collect data from a set of distributed 
heterogeneous sources, (ii) clean and integrate this data into a uniform representation, 
(iii) aggregate and organize this data into multidimensional structures and (iv) refresh 
it periodically to maintain its currency and accuracy. Many problems related with this 
approach have been addressed, such as data extraction and cleaning, selection of the 
best views to materialize, updates propagation, and multidimensional representation 
and manipulation of data. 

Several projects dealing with data warehouses such as Whips [6], H20 [23], DWQ [8] 
have been undertaken. Among the research problem related to data warehousing [20], 
a substantial effort have been devoted to the selection of the set of views to 
materialize [1] [17] [18] [14] [22], and the maintenance of the materialized views 
[21]. Most of the addressed problems are related with the physical design of data 
warehouses. A very few effort is devoted to the conceptual and logical modeling of a 
data warehouse. The only abstract representation given to the database which 
implements a data warehouse is the so-called star-schema or its constellation or snow- 
flake variants. In our knowledge, the properties of these schemas were not formally 
stated, and the semantics of these schemas is even not well-defined. 

Within the DWQ project [9], a substantial effort has been done toward the definition 
of a conceptual schema [2]. A description logic approach is used to formally define a 
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DW conceptual schema as an extended entity-relationship diagram over which some 
desired properties are defined. The approach provides a uniform representation and a 
set of reasoning techniques which allow to correctly specify and validate DW 
schemas. It also provides a theoretical background to study the complexity of some 
design problems. 

The approach presented in this paper follows the same line although positioned at the 
operational level instead of at the definition level. We concentrate on the study of a 
set of views which model the DW as a set of relational queries. We give a 
characterization of a these views in terms of redundancy. This property can be used as 
a design rule to enforce the quality of the logical schema. We then show how this 
model is used to support data warehouse evolution. Some approaches based on a 
repository [15] give a map-road which allows to trace DW changes. [19] studies the 
evolution at the materialized views level. This latter approach is based on a specific 
notion of redundancy which states that a materialized view is redundant if its removal 
does not affect the rewriting of the users queries, and if its removal does not increase 
the operational cost of the DW. If a materialized view is redundant, it can be removed 
from the set of materialized views. 

The paper is organized as follows : In section 2, we describe our motivations. Section 
3 presents the design process of a data warehouse logical schema. Section 4 presents 
one possible use of the logical schema: the control of the data warehouse evolution, 
and a conclusion is given in section 5. 



2. Motivations 

In database design, the conceptual schema is considered as an abstract representation 
which captures user needs in terms of data and constraints on this data. Different 
formalisms have been proposed, ranging from the entity-relationship model with its 
various extensions to more semantic models which include objects behavior. The 
logical schema is more operational, without necessarily being a physical 
implementation. It allows to enforce some logical properties related integrity 
constraints and updates. The physical schema is considered as the actual structure of 
the database on which optimization and tuning can be effectively done. To enforce 
equivalence between the three schemas, mappings are defined between the conceptual 
and the logical schemas and between the logical and the physical schemas. 

A DW is nothing but a database which is less or more complex in its definition. 
Consequently, applying the previous abstraction levels appears as a natural approach. 
The main difference with a database used in a classical application is the feeding of 
this database. The DW is automatically populated from external sources instead of 
being inserted and updated by user applications. Users are limited to query the DW 
and to apply analysis tools such as datamining tools. So, with respect to DW systems, 
the structure mappings existing in a traditional database design process is 
complemented by operational mappings which consider each abstraction level as a set 
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of views of the lower level ; each view being defined as a query on lower level 
schema [2]. 

Users’ needs are represented by a set of views which may be redundant, or may share 
common sub-expressions. Our goal is to represent this set of views by a single 
representation at the logical level. We suppose that each view is represented by a 
relational query, and we propose a logical model for the views of a data warehouse 
called the view synthesis graph (VSG). The high level views are computed using a set 
of intermediate views or base views. The base views are directly computed from the 
relations of a single source. Each intermediate view represents either an expression on 
which a single relational operation is applied to produce a user view, or an expression 
shared by two or more intermediate views. 

The logical model can be used for many purposes [11] [10] : (i) as an intermediate 
representation between the conceptual design and the physical design, providing an 
operational view of the DW without necessarily dealing with performance nor 
physical representation of data, (ii) as a reference schema from which the physical 
design starts and to which the benefit of the selected materialized views is balanced, 
(iii) as a support to control the DW evolution both at its client and source levels. 

The logical model provides an independency between user queries and materialized 
views. Consequently, it allows source and view evolution. We focus here on the 
control of changes affecting the user needs, data sources or the design choices. For 
each case, we show how the logical model can be used to determine the effects of 
these changes. 



3. The design approach of the data warehouse logical model 

The design process of the DW logical schema starts from the representation of the 
different user views. We assume that these views are represented by a relational 
expression. The result of this process is the logical schema, called the view synthesis 
graph (VSG). One important feature of the VSG is that it shows a set of sub- 
expressions which are shared by several user views. The design steps of the logical 
schema are the following : 

• The generation of the multi-query graphs (MQGs) : we suppose that each user 
view Vi is a relational query to which a set of equivalent plans are associated. 

• The generation of the multi-view graph (MVG) : the set of user view expressions 
are integrated into a single graph called the multi-view graph. This graph 
represents all the MQGs produced during the previous step. 

• The generation of the view synthesis graph (VSG) : this graph is generated from 
the MVG by eliminating redundant expressions. The resulting graph is the logical 
schema of the data warehouse. 
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3.1. The multi-query graphs 



Let us consider a user view Vi. This view can be represented by a set of equivalent 
algebraic expressions denoted Evi and such that: Evi = {eivi, e2vi, Ckvi}- The 
algebraic expressions corresponding to the user view Vi are denoted ekvi- They use a 
set of operations such as join, project and restrict operations, or some aggregate 
function. Each expression represents a given plan for the user view Vi. In order to 
limit the number of considered plans, we will consider only those for which the 
project operation is the last one. 



V(A,G,H) 




R1(A,B,C) R2(C,G,E) R3(E, H) 



V(A,G,H) 




(a) 



(b) 



Fig. 1 . The multi-query graph associated with a user view 



Given a user view Vi and the associated set of equivalent expressions Evi, the 
MQG represents each element ejvi of the set Evi. The MQG does not represent all the 
equivalent expressions to the initial view, because the search of these expressions may 
lead to an infinite set of expressions. Eor example, in the query given figure la, 
computed from the source relations Ri, R2 and R3, the multi-query graph (figure lb) is 
composed of two expressions representing the two possible orderings of the join 
operations. 



3.2. The multi-view graph 

Each user view being represented by a MQG, the multi- view graph (MVG) 
consists in integrating all the MQGs into a single graph. More formally, if V = { Vi, 
V2, ...., V„ ) represents the set of user views, and Eyi = {civi, e2vi , Ckvi) the set of 
equivalent expressions corresponding to a view Vi. The MVG is such that: MVG = 
{Evi, Ev 2. Evn) 

Figure 2 shows an example of the graphical representation of an MVG, built by 
integrating the MQGs corresponding to three user views Vi(A, B, C, D, E, F), V2(A, 
G, H) and V3(A, B, E, F). In this example, we can notice that some sub-expressions 
are used by more than one user view. An ellipse surrounds these expressions. 
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H 



Join operation 



Restriction operation 



Projection operation 



Fig. 2. Example of a multi- view graph (MVG) 



The MVG graph represents a set of equivalent expressions for all the user views of 
the DW. If we consider the graphical representation of a MVG, each node in the 
graph represents either a source relation, a sub-expression, or an expression 
corresponding to a user view. The graph shows all the existing sub-expressions used 
by each user view. Each sub-expression may be either shared by several plans, or 
used in only one plan. The sub-expressions used by more than one user view are 
surrounded by an ellipse. 



3.3. The view synthesis graph 

The view synthesis graph (VSG) is a canonical representation derived from the MVG. 
This representation is such that each user view is represented by only one plan. For 
each user view, among all the plans of the MVG, only one will be kept in the VSG. 
The plans of the VSG are chosen according to the two following criteria : (i) the 
maximal number, denoted p, of user views which share the same sub-expression in 
the MVG, (ii) and the number of shared sub-expressions, denoted n. 

In the example given in figure 2 , the shared sub-expressions are surrounded by an 
ellipse. The first of these sub-expressions, the join operation between Ri and R2, is 
shared by two different plans of the same user view Vi; the second of these sub- 
expressions, the join operation between R2 end R3, is shared by plans representing the 
three views Vi, V2, and V3. The maximal number of user views which share the same 
sub-expression is then 3 . Intuitively, building a VSG by maximizing p will result in a 
graph which maximizes the number of user views which share a sub-expression, 
while building a VSG by maximizing n will result in a graph which maximizes the 
number of shared sub-expressions. 
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In the remaining of this section, we will present some definitions to characterize a 
VSG in terms of a level of overlapping. 

3.3.1. Level of dependency of a given expression 

The level of dependency of an expression ejvi representing a user view Vi in a multi- 
view graph MVG is the number of distinct user views Vi such that there exists an 
expression e^vi which share a suh-expression e with ejvi. This notion of dependency 
allows to define the redundant plans in a MVG. 

3.3.2. Redundant plans 

A plan evi is redundant in the MVG if there exists a distinct plan eVi representing the 
same view Vi such that level_of dependency{s\i) < level_of dependency{e’yi). If eVi is 
the plan having the highest level of dependency n(e’vi), all plans having a lower level 
of dependency are said to be redundant. Given this notion of redundancy, a view 
synthesis graph (VSG) is defined as a graph representing only one expression for each 
user view, and in which each plan evi representing the user view Vi is not redundant. 
By definition and construction, the multi-view graph is unique. The view synthesis 
graph is not unique. For a given view Vi, there may exist more than one plan having 
the highest level of dependency, and consequently, the same MVG may lead to more 
than one VSG. 

3.3.3. Level of overlapping of a view synthesis graph 

The level of overlapping of a VSG is the maximal level of dependency of the 
expressions represented in this graph. Consider that the VSG contains a number k of 
users views Vi described by their plans denoted eyi, the level of overlapping of this 
VSG, denoted n(VSG) is such that n(VSG) = Max(n(eyi), i= 1, k). The VSG is a 
hierarchy of views. Besides the user views, this hierarchy contains intermediate views 
and base views defined as follows : 

• Each sub-expression computed from the relations provided by a single data source 
is defined as a base view. 

• A given sub-expression in the VSG is an intermediate view if it satisfies the 
following conditions : 

1. this sub-expression is shared by more than one other expression in the VSG 

2. this sub-expression is neither a base view, nor a user view. 

Starting from the MVG given in figure 2, the VSG is built by (i) eliminating the plans 
which do not share any sub-expression with other user views, and (ii) by eliminating 
redundant plans, that is plans which level of dependency is lower than the one of a 
distinct plan representing the same user view. The resulting VSG is given in figure 3. 
Each non-redundant plan corresponding to a user view V is represented by an 
algebraic tree. The expressions RBi, RB2 and RB3 are the base views. Each of them is 
computed from the data provided by a single data source. These expressions can be 
viewed at as the extractors associated with each source. The expression RB 2 is an 
intermediate view shared by the three user views Vi, V2 and V3. 
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V1(A,B,C,D,E,F) 






V2(A,G,H) 



V3(A,B,E.F) 



RB1(B,D) 

I 

R1(B,D) 
Source SI 



RB2(A,B,C,E,F) I I 

-XI— 1 



R2(A,B,C) R3(E,F,A) 

Source S2 



Fig. 3. Example of a logical model 



RB3(A,G,E) 



R4(G,HE) 
Source S3 



4. Data warehouse evolution 

The logical model of a DW represents a canonical representation of the user views. 
One possible use of this representation is for controlling the DW evolution. We 
consider hereafter three distinct cases of possible evolutions: the evolutions of the 
user views, the evolution of the data sources, and the evolution of the materialized 
views. In each of these situations, the view synthesis graph can be used in order to 
evaluate the Impact of such changes in the DW environment. 



4.1. Evolution of the users needs 

Two situations may occur with respect to user needs: a user view can be either added 
or removed from the data warehouse. 

Adding a new view : If a new view V; is added to the data warehouse, the VSG can 
be used to check if this view can be derived or not from the hierarchy of views 
represented by the logical model. Three situations may occur : 

• V; can be derived directly from the VSG if it satisfies the following criteria : (i) the 
expression representing Vj is either a base view or an intermediate view of the 
VSG, (ii) the expression representing Vj is a sub-expression of the VSG, 

• V; can be derivable indirectly from the VSG if it is derivable from one of its 
transformations, that is if it is derivable from the corresponding MVG. Notice that 
the MVG is always derivable from the VSG. 
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• Vi is not derivable from the VSG, in this case Vi is added to the VSG ; the 
expression representing Vi which is added to the VSG is a non-redundant plan. 
Each sub-expression of Vi shared by other views is a new intermediate view, and 
each sub-expression computed from the data provided by a single source is a new 
base view. 

Deleting an existing view : If a view Vi is deleted from the data warehouse, the 
corresponding expression is removed from the view synthesis graph. This operation 
may lead to the change of some intermediate views or base views because some of the 
initial intermediate views may become not shared, and some of the initial base views 
may become useless, if they were used only for computing the view Vi. 



4.2. Evolution of data sources 

Another change that may occur in the data warehouse environment is adding or 
deleting a data source. A new source can be added to the environment, and an existing 
source may be disabled. Each of these situations has some effects on the DW 
environment; these effects can be determined using the view synthesis graph. 

Adding a new data source: If a new data source is added to the data warehouse 
environment, two situations may occur : 

• the new source Sj has the same object types as an existing source Si; in this case, 
the VSG may be changed. Different modifications can be done depending on the 
way users want to integrate this source in their existing views. Union, intersection 
and difference are candidate operations to integrate the new source. This leads to 
the change of some intermediate and base views. Eor example, consider the logical 
model given in figure 3 and suppose a new source S which contains a source 
relation R(B, D) is added. R and Ri have the same schema and the semantics of the 
application needs the merge of these two relations. The changes occurring to the 
VSG are shown in figure 4. 

The view Vi which was computed from the relation Ri of source Si is now computed 
from the union of the two source relations Ri and R. The union operator is not a 
shared operator, and the set of intermediate views remains identical. But a new base 
view RB is introduced in the VSG to represent the extraction of the new source 
relation R. 

• the new source Sj is different from any of the existing sources : in this case, the 
view synthesis graph is not changed since none of object types in Sj is used by the 
existing views. The new source Sj can be used only by the newly added views 
(updates of existing views are considered as definition of new views). 

Deleting a data source 

If an existing data source is deleted from the DW environment, it may become 
impossible to compute some views. These views are identified using the VSG. 
Consider for example that the source S 3 is deleted in the DW described by the logical 
model given in figure 4. It becomes impossible to compute the view V 2 , which uses 
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the source relation R 3 . If the source S is deleted, the view Vi is still computable using 
the source Si. Deleting a source can modify the sets of base views or intermediate 
views. For example, if the source S is deleted, RB is deleted from the set of the base 
views. In general, the effects of deleting a data source Si in the DW environment are 
determined by identifying each view V of the VSG that uses a relation Rj contained in 
this source. If there is a distinct source relation R^ such as R^ and Rj have the same 
schema, then view V is still computable; if not, the view V becomes not computable. 



V1(A,B,C,D) 




V2(A,G,H) 



V3(A,B,E,G) 



R1(R,D) 
Source SI 



RB( 


B,D) 


R(B,D) 






RB3(G,E) 



R2(A,B,C) R3(E,F,A) R4(G,H, E) 

Source S2 Source S3 



Fig. 4. Adding a source to the data warehouse environment 



4.3. Evolution of the materialized views 

Another change which can affect the DW is the evolution of the physical schema, i.e. 
the materialized views. Materialized views are selected on the basis of a cost function 
which models a compromise between the cost of maintaining the materialized views 
and the cost of evaluating the virtual views. This cost takes into account a set of 
parameters such as access frequency, storage space and update propagation cost. 
When one of these parameters changes, the actual materialized views are not 
satisfactory with respect to performance objectives. Then a new physical design is 
restarted and a new set of materialized views is selected. Changing the materialization 
may result in changing all the expressions defined over the first set of materialized 
views, unless these expressions were defined on the view synthesis graph which is 
then used as a logical schema. The existence of the VSG provides a logical 
independence between user applications and the data warehouse physical schema. The 
changes on the physical schema leads to changes in the expressions which define the 
mappings between the logical views (the VSG) and the materialized views. 
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5. Conclusion 

In this paper we have presented a logical reference schema and its use for controlling 
the evolutions of the DW environment. This reference schema is a canonical 
representation of data warehouses views. This logical model is huilt from subsets of 
the possible plans for each user view, which are integrated in an intermediate 
representation called the multi- view graph. The VSG is built by identifying for each 
view one plan which is not redundant in the multi-view graph. The VSG is useful for 
different purposes, among them the control of the evolution of the data warehouse. 
We have presented three different kinds of changes that may occur in a data 
warehouse environment, namely changes affecting the users needs, changes affecting 
the data sources and changes affecting the physical design. In each situation, we have 
shown how the view synthesis graph can be used to determine the effects of each of 
these evolutions. 
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Abstract. Schema design is one of the fundamentals in database theory and prac- 
tice as well. In this paper, we discuss the problem of locally valid dimensional 
attributes in a classification hierarchy of a typical OLAP scenario. In a first step, 
we show that the traditional star and snowflake schema approach is not feasible 
in this very natural case of a hierarchy. Therefore, we sketch two alternative mod- 
eling approaches resulting in practical solutions and a seamless extension of the 
traditional star and snowflake schema approach: In a pure relational approach, we 
replace each dimension table of a star / snowflake schema by a set of views 
directly reflecting the classification hierarchy. The second approach takes advan- 
tage of the ohject-relational extensions. Using object-relational techniques in the 
context for the relational representation of a multidimensional OLAP scenario is 
a novel approach and promises a clean and smooth schema design. 



1 Introduction 

In the last few years, “Online Analytical Processing” (OLAP, [CoCS93]) has become 
a major research area in the database community (special data models: [VaSe99]; SQL 
extensions: [GBLP96], [SQL99]). One consequence of the OLAP fever is the rejuvena- 
tion of the multidimensional data model. The ROLAP approach (“Relational OLAP”) 
simulates the multidimensionality and performs data access on top of a relational data- 
base engine, thus using sophisticated relational base technology to handle, i.e. store and 
analyze the typical large data volumes of the underlying data warehouses. This 
approach however needs an adequate relational representation, which is typically a vari- 
ation of a star / snowflake schema. 

Based on the experiences from an industrial project, we have seen that the traditional 
modeling techniques for the relational based solution, star and snowflake schema, are 
not always adequate ([ALTK97]). Even considering multiple variations with regard to 
slowly changing dimensions, factless fact tables, etc. ([Inmo96]), we demonstrate a 
modeling problem which is not addressed adequately in literature. This paper reviews 
our schema design problems with the traditional techniques and proposes two more 
general and therefore alternative modeling approaches. 

The key idea of the multidimensional model is that each dimension of a multidimen- 
sional data cube, e.g. products, shops, or time, can be seen as part of the primary key, 
defining the cartesian product with the elements of dimensions. Consequently, any com- 
bination of the composite primary key identifies exactly a single cell within the cube. 
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Each cell may hold a numerical fact value 
(measure) or a NULL value if no such 
entry exists. As illustrated in figure 1, 
based on the dimensional elements, e.g. 
single articles in the product dimension, 
classifications can be dehned to identify 
different classes C like product families, 
groups, or product areas. Each classifica- 
tion node at a specific classification level 
can be seen as an instance of a corre- 
sponding classihcation attribute (CAj). 

Additionally, dimensional attributes (DA(^) 
like brand, color, shoptype etc. can be 
used to enrich the multidimensional anal- 
ysis process. As depicted in figure 2, these 
attributes, characterizing single dimen- 
sional elements, are standing orthogonal 
with regard to the classihcation hierarchy, 
classifying dimensional elements. 

Thus, a typical question according to a multidimensional scenario could be as follows: 
Give me the total sales of consumer electronics goods for Europe and the first quarter 
of 1997 by different brands and different shop types. 

It is worth mentioning here that such a structure with special (better: locally valid) 
dimensional attributes for different classes within the classihcation hierarchy rehects 
the basic idea of classihcation. What else should be the reason for classihcation? Clas- 
sihcations are generally used to hide specialities of subclasses and perform an abstrac- 
tion process when going from subclasses to super-classes. We think that this idea should 
be adequately rehected in the relational schema design of a multidimensional scenario. 

2 The Traditional Relational OLAP Approach 

To illustrate the failure of the traditional ROLAP approach and to motivate our alterna- 
tive approach, we refer to a sample scenario, stemming from a joint research project 
with a worldwide operating retail research company. In their business, facts like sales, 
stock, or turnover values of single articles in single shops at a specihc period of time are 
monitored and collected to form the raw database. Eor example: 

Facts ( ArtlclelD, ShopID, Period . SALES, STOCK, TURNOVER) 

TR-75 203 05/97 121 78 333 

TS-78 203 05/97 112 63 121 




PA; primary attribute I TR-75 I : dimensional element 

CAj! classification attribute at classification level i O' classification node C (class) 



Fig. 1. Sample classification hierarchy 
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I TR-75 I : dimensional element 



DAj: dimensional attribute 

Fig. 2. Sample dimensional attributes 



Generally, this raw database is commonly called a /act table and consists of two main 
components: A set of dimensions, we denote as ‘primary attributes’ PAj (1<i<n) forming 
the composite primary key of the table and a set of measures {f-|, ..., f(^} denoting the 
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figures being analyzed. The number of primary attributes n determines the dimension- 
ality of the problem; 

Facts(PA.|,.,.1 ,PAn, fp..., y 

In a further step of the retail research evaluation phase, the raw database is analyzed in 
two ways: On the one hand, the data is aggregated along a predefined classification hier- 
archy (like the one in figure 1). On the other hand, the data is split into characteristic 
features of the single articles or shops. For example, each shop holds country specific 
information about its purchase class or its shop type (“cash&carry”, “retail”, “hyper- 
market”). In the product dimension, each article of the 250.000 monitored products 
belongs to one of 400 product families. Furthermore, each article is characterized by 
five attributes valid for all products {‘brand’, ‘package type ’, ...) and about 15 attributes 
which are valid only in the product family or product group to which the article belongs 
to {‘video system’ only for the product group Video equipment, ‘water’ usage only for 
the product family Washers). 



2.1 Star Schema 



The simplest traditional way to model this qualifying information skeleton used during 
the analysis process is to use a single dimension table D' (1<i<n) for each dimension to 
resolve high-level terms according to the classification hierarchy and to represent 
dimensional attributes. Since each dimension is connected to the corresponding primary 
key of the fact table, the whole scenario looks like a ‘Star’. Figure 3 illustrates the star 
schema for the ongoing example. Formally, the schema of the dimension table for the 
dimension i consists of the primary attribute PAj, all classification attributes CAj (1<j<p) 
and the complete set of dimensional attributes DA|< (1<k<m). 



D'(PA|, DAi, ...,DA^, CAi,...,CAp) 

It is worth to note that (mainly for performance 
reasons) the classification hierarchy is modeled 
as a set of functionally dependent attributes 
(CAj — > CAj+1 (1<j<p)). Furthermore, the distinc- 
tion of dimensional elements organized in hierar- 
chies and dimensional attributes further charac- 
terizing the elements explicitly prescribed in the 
multidimensional model gets totally lost. 
Figure 4 shows the product dimension table for 
the market research example. 

DAi DAj • ■ ■ 



PA 

Articles ( ArticlelD . 



TR-75 


Sony 




HIS 


TS-78 


Sony 




HIS 


A200 


JVC 




NS 


V-201 


JVC 




VHS 


ClassicI 


Grundig 




VHSi 


AB1043 


Ariston 


NULL 


Princess 


Miele 


NULL 


Superll 


Hoover 


NULL 


Duett 


Miele 


NULL 


Lavamat 


AEG 


NULL 



RC, 

NULL 

NULL 

NULL 

Yes 



NULL 
NULL No 



NULL 

NULL 

NULL 

NULL 

NULL 



NULL 

NULL 

NULL 

NULL 

NULL 



NULL 

NULL 

NULL 

NULL 

NULL 



Water, 

NULL 

NULL 

NULL 

NULL 

NULL 



5 kg 




451 


5 kg 




411 


4 kg 




541 


6 kg 


NULL 


6kq 


NULL 




Fig. 3. Sample Star Schema 



DAm 

Temp, 

NULL 

NULL 

NULL 

NULL 

NULL 

NULL 

NULL 

NULL 

Tt^c 

39°C 



CAi CA2 CA3 

Family, Group, Area) 



Camcorder Video 
Camcorder Video 
Camcorder Video 
HomeVCR Video 
HomeVCR Video 



ConsElectr 

ConsElectr 

ConsElectr 

ConsElectr 

ConsElectr 



Washer HomeAppl WhiteGoods 
Washer HomeAppl WhiteGoods 
Washer HomeAppl WhiteGoods 
Dryer HomeAppl WhiteGoods 
Dryer HomeAppl WhiteGoods 



Fig. 4. Sample star schema dimension table for the product dimension 
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2.2 Snowflake Schema 



Elimination of the functional dependencies in the single dimension tables, i.e. normal- 
izing the star schema, leads to small satellite tables representing the dimensional hier- 
archy. Therefore, the set of dimension tables for a single dimension is modeled in the 
following manner: 

d'(PA|, DAi,...,da^ , CAi) 

d'i(CAi, ,...,da^^,CA2) 

°p-1 (2Ap_i, “np_i + 1 CAp) 

Again, PAj (1<i<n) denotes the primary attribute for the i-th dimension for joining the 
fact table. The classification attribute CAj forming the hierarchy acts as a foreign key in 
classification level J-1 and as a primary key in level j. Furthermore, all dimensional 
attributes DA(^ at level j are fully dependent on CAj. To adopt the dimension table for the 
ongoing example, the classification attributes ‘Group’ and ‘Area’ are shifted into two 
new relations. The relational schema for the product dimension of the current example 
is depicted in figure 5 : 

Articles ( ArticlelD . Brand, VSys, BLT, RC, ... Load, Water, Temp, Family) 

TR-75 Sony HIS 2h NULL . . . NULL NULL NULL Camcorder 

TS-78 Sony HIS 2h NULL ... NULL NULL NULL Camcorder 

A200 JVC NS 3h NULL . . . NULL NULL NULL Camcorder 

V-201 JVC VHS NULL Yes . . . NULL NULL NULL HomeVCR 

ClassicI Grundig VHS-C NULL No ... NULL NULL NULL HomeVCR 

AB1043 Ariston NULL NULL NULL . . . Bkg 451 NULL Washer 

Princess Miele NULL NULL NULL . . . Bkg 411 NULL Washer 

Superll Hoover NULL NULL NULL . . . 4kg 541 NULL Washer 

Duett Miele NULL NULL NULL . . . 6kg NULL 37°C Dryer 

Lavamat AEG NULL NULL NULL . . . 6kg NULL 39°C Dryer 



Families ( Family . Group) 



Camcorder Video 


HomeVCR 


Video 


Washer 


HomeAppl 


Dryer 


HomeAppl 



Groups ( Group . Area) 

Video ConsElectr 

HomeAppl WhiteGoods 



Fig. 5. Sample snowflake schema dimension tables for the product dimension 



2.3 Summary and Conclusion 

The star/ snowflake schema approach allows modeling a wide range of simple multidi- 
mensional scenarios. From a performance point of view the star schema avoids a lot of 
lookup joins with the satellite tables but it is reprehensible from a schema design point 
of view. 

Both traditional approaches however fail from an implementation and schema design 
point of view, if the existence of dimensional attributes depends on values of the dimen- 
sional elements in the classification hierarchy. As we have seen in the market research 
scenario, this problem becomes even worse, e.g. dimensional attributes ‘Water’ and 
‘Load’ are only applicable for washers and not for video equipment. Hhe. Alternative 
Relational OLAP Approach 
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With our solution, we adopt the idea of [SmSm77] introducing a special class of 
attributes, called discriminating attributes. This kind of attributes holds relation names 
as their attribute values, which allows a real hierarchical representation of the problem 
of mapping a pyramid of concepts [Lore87], i.e. the classification hierarchy, to the rela- 
tional data model. 



In a first step, we can model each product group, i.e. leaf node of the classification tree, 
in a separate relation with all the node specific dimensional attributes. For example, the 
four sample product families from figure 1 are modeled in the following way: 



Camcorder (ArticlelD , 


Brand, 


VSys, BLT) 


Washer (ArticlelD . 


Brand, 


Load, 


Water) 


TR-75 


Sony 


HIS 2h 


AB1043 


Ariston 5kg 


451 


TS-78 


Sony 


HIS 2h 


Princess 


Miele 


5 kg 


411 


A200 


JVC 


NS 3h 


Super I I 


Hoover 


4 kg 


541 


HomeVCR (ArticlelD. 


Brand, 


VSys, RC) 


Drver (ArticlelD. 


Brand, 


Load, 


Temp) 


V-201 


JVC 


VHS Yes 


Duett 


Miele 


6 kg 


37°C" 


ClassicI 


Grundig VHS-C No 


Lavamat 


AEG 


6 kg 


39°C 


More formally, the schema of a classihcation node C at the lowest. 


i.e. hrst classihcation 



level with the classification attribute CA.| is denoted by: 
C(PA, DAi DAJ 



As usual, PA is the primary attribute for the join with the fact table and the set of DA^ 
(1<k<m) denotes the dimensional attributes which are applicable to the classihcation 
node C. 



The construction of the classification hierarchy is made in a bottom-up fashion, i.e. sets 
of classification nodes are grouped into a new high-level term, i.e. a new classihcation 
node. Suppose, in the j-th step, the classihcation nodes {C , ..., C'' } with the set of 
locally valid dimensional attributes {DA.,' , ..., DA^ ' } for each C' corresponding to the 
classihcation attribute CAj_i are subsumed by the higher level node C corresponding to 
the classihcation attribute CAj. The set of valid dimensional attributes is achieved by 
intersecting all attribute sets of the subsumed nodes. 

{DA^,..., DA^}:= nJoAi^..,DA^^j 

For example, Camcorders and HomeVCR are classihed into the class Video. Washers 
and Dryers are subsumed by the new higher level classihcation node Home Appliances. 
Furthermore, only those dimensional attributes are propagated to the new parent node, 
which are still valid there. Hence, the specihc dimensional attributes "BUT (for Cam- 
corder) and ‘/?C’ (for HomeVCR) are lost, whereas the attributes ‘VSys’ and ‘Brand" are 
propagated to the video class. 



Generally, the schema of a new classihcation node C for the classihcation attribute CAj 
(1<j<p) is algorithmically determined by: 



C(^, DA^, ..., DA„,, CA^, ...,CAj_i,CAj) 



.y^"(PA, DA„ 



,DA,,,CA„....CAi_i.'C'') 



The key point of this technique is that each classification node C is added as a constant 
value for the new classification attribute CA of the new classification node C. 



In building the higher level classes, we intensively use the view mechanism of the rela- 
tional database system at the implementation side. Below, the view dehnitions of the 
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product groups Video and Home Appliances are illustrated. In analogy to the formal 
description, each relation name appears as a constant value in the new view: 



create view Video 




Video (ArticlelD, 


Brand, 


VSys, 


Family) 


(ArticlelD, Brand, VSys , Family) as 




TR-75 


Sony 


HIS 


Camcorder 


select ArticlelD, 


Brand, VSys, 'Camcorder' 




TS-78 


Sony 


HIS 


Camcorder 


from Camcorder 

union 

select ArticlelD, 


— W 

Brand, VSys, 'HomeVCR' 




A200 

V-201 

ClassicI 


JVC 

JVC 

Grundig 


N8 

'ras 

VHS-C 


Camcorder 

HomeVCR 

HomeVCR 



create view HomeAppl 




HomeAnnl (ArticlelD. 


Brand, 


Load, Family) 


(ArticlelD, Brand, Load, Family) as 


AB1043 


Ariston 


. 5kg Washer 


select ArticlelD, Brand, 


Load, 'Washer' 


Princess 


Miele 


5kg Washer 


from Washer 





— Superll 


Hoover 


4kg Washer 


union 




Duett 


Miele 


6kg Dryer 


select ArticlelD, Brand, 


Load , 'Dryer' 


Lavamat 


AEG 


6kg Dryer 



from Dryer 



Each view holds the primary attribute, the applicable dimensional attributes and (in 
analogy to the star schema) all classification attributes. Furthermore, each view builds 
the basis for defining higher level classification nodes. This recursivly definition is 
shown below for the classification hierarchy depicted in figure 1. 



create view ConsElectr 

(ArticlelD, Brand, Family, Group) as 
select ArticlelD, Brand, Family, 'Video' 
from Video 
union .... 



create view WhiteGoods 

(ArticlelD, Brand, Family, Group) as 
select 

ArticlelD, Brand, Family, 'HomeAppl' 
from HomeAppl 
union .... 



create view Articles 

(ArticlelD, Brand, Family, Group, Area) as 
select ArticlelD, Brand, Family, 'ConsElectr' 
from ConsElectr 
union 

union 

select ArticlelD, Brand, Family, Group, 'WhiteGoods' 
from WhiteGoods 



In comparison to the tradi- 
tional modeling approach 
(figure 4), only those 
attributes are available in 
the dimension table which 
are applicable for all 
dimensional elements. To 
address specific features, 
the corresponding dimen- 
sional sub-tables have to 



Articles ( ArticlelD . Brand, 
Sony 
Sony 
JVC 
JVC 

Grundig 



TR-75 

TS-78 

A200 

V-201 

ClassicI 

AB1043 
Princess 
Super I I 
Duett 
Lavamat 



Ariston 

Miele 

Hoover 

Miele 

AEG 



Family, 

Camcorder 

Camcorder 

Camcorder 

HomeVCR 

HomeVCR 

Washer 

Washer 

Washer 

Dryer 

Dryer 



Group , 
Video 
Video 
Video 
Video 
Video 

HomeAppl 

HomeAppl 

HomeAppl 

HomeAppl 

HomeAppl 



Area) 

ConsElectr 

ConsElectr 

ConsElectr 

ConsElectr 

ConsElectr 

WhiteGoods 

WhiteGoods 

WhiteGoods 

WhiteGoods 

WhiteGoods 



Fig. 6. Sample dimension table for the product dimension 
(alternative approach) 

be used, whose names are specified as instances of the classification attributes. Figure 6 
summarizes the modeling of the classification hierarchy using the proceeding, illus- 
trated in this section. In analogy to figure 5 (snowflake schema), our approach can be 
straightforwardly normalized, too. 



To put it into a nutshell, in the case of the traditional approach, the (senseless) query 
asking for total sales of Home Appliances by "video system’ would result in a table scan 
of the dimension table, resulting in an empty join partner for the fact table and at last in 
a numerical zero. In our alternative approach, the query would be rejected, since Home 
Appliances does not contain a dimensional attribute "video system’ . 





An Alternative Relational OLAP Modeling Approach 195 



3 Object-Relational Design 

Another alternative and novel approach to overcome the limitations of the traditional 
star-/snowflake schema pattern is to use object-relational techniques. Since object-rela- 
tional concepts are supported only to a certain degree and implemented in a very sys- 
tem-specihc manner, we propose an object-relational schema design based on the capa- 
bilities of the IBM DB2/UDB V6.1 database system. 

The design of an object-relational schema in DB2 is divided into two phases. In a hrst 
step, we have to define the type hierarchy and references based on types. Referring to 
these types, we are then able to ’instantiate’ regular tables (also called: typed tables). 
Moreover, in opposite to the approach shown in the previous section, we have to pro- 
ceed top-down when defining the object-relational schema of the dimensional struc- 
tures. 



3.1 Type Definitions 

The super-type of a dimension holds only the most generic dimensional attributes and 
all possible classification attributes. For the ongoing example of the product dimension, 
the following DDL statement introduces the generic type of ArticlesJT, where the single 
articles are identified by the ArticlelD attribute. 

create type Articles_T as (brandvarchar (30) , 
area varchar(30), 
group varchar(30), 
family varchar(30) ) ; 

Analogous to the classical concept of inheritance, special classes of products are 
derived from the more general classes of products and specific dimensional attributes 
are added to the derived type. Below are the SQL statements to define the necessary sub- 
types of the product classification hierarchy. For each sub-type, the keyword UNDER 
denotes the corresponding super-type. 



create 


type 


ConsElectr T 


under 


Articles 


T 


as 


( . . . 


) ; 


create 


type 


WhiteGoods T 


under 


Articles 


T 


as 


( . . . 


) ; 


create 


type 


Video T 


under 


ConsElectr 


T 


as ( vidsysvarchar ( 3 0 


create 


type 


HomeAppl T 


under 


WhiteGoods 


T 


as (loadvarchar (30) 


create 


type 


HomeVCR T 


under 


Video T 


as 




(RC 


char (1) ) ; 


create 


type 


Camcorder T 


under 


Video T 


as 




(BLT 


varchar (5) 


create 


type 


Washer T 


under 


HomeAppl 


T 


as 


(Water varchar(5) 


create 


type 


Dryer T 


under 


HomeAppl 


T 


as 


(Temp 


short ) ; 



Once we have defined the types of the dimensional structures, e.g. for the Products 
dimension and the Shops dimension, we are able to create a type for the fact table. 
Although this is not a mandatory step to design the schema of the fact table using OR 
technology, we are able to achieve some advantage when querying the database later. 
Therefore, the type of a fact table consists of two references to the super-types of the 
participating dimensions. 

create type Facts_T as ( 

ArticlelD REF (Articles_T) , 

ShopID REF(Shops_T) , 

Period date, 
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Sales integer, 

Stock integer, 

Price integer) ; 

It is worth mentioning here that this construction yields the following advantage: Con- 
sider the situation where we have to setup a data mart for a specific product group, e.g. 
video equipment. We are now able to create a type of a customized fact table allowing 
only references to the articles belonging to the video class. Therefore, we would substi- 
tute the reference to the generic articles type by a reference to the more specific type of 
Video_T (... ArticlelD REF(Video_T), ...)■ This design ensures already at the schema 
level (! !) that the fact table for this specific data mart will never contain a product other 
than a video article. 

3.2 Typed Table Definitions 

Once we have defined the type hierarchy for the classification schema, we are now in 
the position to ’instantiate’ these types resulting in so-called ’typed tables’. Analogous 
to the type definition, we proceed top-down. For instantiation of the super-type, we need 
to introduce a object identifier attribute. For the ongoing example, we use the ArticlelD 
as reference attribute (or primary key attribute in relational terminology), which content 
is given as user generated (in opposite to system generated). All dependent sub-tables 
are instantiated according to their type and referring to their direct super-type. 

create table Articles of Articles_T (REF IS ArticlelD user generated) ; 
create table ConsElectr of ConsElectr_T under Articles inherit select privileges; 
create table WhiteGoods of WhiteGoods_T under Articles inherit select privileges; 
create table Video of Video_T under ConsElectr inherit select privileges; 
create table HoraeAppl of HoraeAppl_T under WhiteGoods inherit select privileges; 
create table HomeVCR of HomeVCRF_T under Video inherit select privileges; 
create table Camcorder of Camcorder_T under Video inherit select privileges; 
create table Washer of Washer_T under HomeAppl inherit select privileges; 
create table Dryer of Dryer_T under HomeAppl inherit select privileges; 

Again it is worth mentioning that this construction provides a huge advantage when 
dealing with changing dimensions. Consider again the product dimension. New articles 
may be added, some articles may be re-classified, and other articles may be deleted 
because they are no longer sold or their sales are no longer monitored. Since, however, 
all articles are of the same type, we can simply instantiate a specific type multiple times. 
Each resulting table reflects then a valid state of the classification hierarchy an can be 
used to analyze the fact data under certain valid time perspectives. 

After creating the dimensional hierarchies, we can instantiate the fact table from the 
type Facts_T. Two aspects have to be considered explicitly. First of all, each fact gets a 
system generated object identifier FactID. More important is the instantiation of the ref- 
erences defined in the type specification. Each reference is pointing to an appropriate 
table, e.g. ArticlelD is referencing the Articles table, not longer the Articles_T type. 

create table Facts of Facts_T (ref is FactID system generated, 

ArticlelD with options scope Articles, 

ShopID with options scope Shops) ; 
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3.3 Data Manipulation 

The object-relational design of a data warehouse database has also some impact on data 
manipulation. Consider a simple insert statement for the product dimension. The only 
difference to a ’pure relational’ insert is the casting of the article identification to the 
specific object identifier. Below is an example for a camcorder and a Dryer article. 

insert into Camcorders { ArticlelD, brand, vidsys, bit, family, group, area) values 
(Camcorders_T { ' TR-75 ' ) r 'Sony', 'HI8', '2h', 'Camcorder', 'Video', ' ConsElectr ' ) ; 
insert into Dryers ( ArticlelD, brand, load, temp, family, group, area) values 
(Dryers_T (' Duett ') , 'Miele', '6kg', '37oC', 'Dryer', 'HomeAppl', ' WhiteGoods ' ) ; 

To keep the consistency of references, we are now able to insert facts into the fact table. 
Again, we have to cast the values of the references to their corresponding type of the 
dimension tables, i.e. article identifiers are casted to Articles_T, shop names are casted 
to Shops_T. 

insert into Facts (FactID, ArticlelD, ShopID, Period, Sales, Stock, Price) values 
(Facts_T ( ' 1' ) , Article_T('TR-75' ) , Shops_T ( ' TeVi ' ) , '1999-12-02', 45, 22, 998); 

When querying the database we can take advantage of the references which may be 
visualized as predefined join paths. For example, grouping fact data by region (from the 
Shops table) and Products groups (from the Articles table) may be specified using the 
’->’ operator without any explicit join. 

select f . ShopID- >region, f . ArticleID->group, SUM(sales) 
from Facts f 

group by f . ShopID- >region, f .ArticlelD- >group 

However, we can not retrieve specific attributes using this construction. The grouping 
by Video Systems for all video equipment would be expressed with the "Video’’ as the 
correct dimension table as follows: 

select f . ShopID- >region, a. vidsys, SUM(sales) 
from Facts f. Articles a 
where f. ArticlelD = a. ArticlelD 
group by f . ShopID- >region, a. vidsys 

In summary, object-relational technology provides a powerful tool for schema design in 
data warehouse environments. Especially, the design of a real classification hierarchy 
with classification and dimensional attributes finds an adequately representation using 
inheritance mechanisms on types and typed tables. Extending the object-relational rep- 
resentation to the fact table enables the designer to define some kind of structural con- 
straints already at the schema level. 

4 Summary and Conclusion 

This paper addresses the problem of locally valid dimensional attributes within a clas- 
sification hierarchy in the context of a multidimensional schema design. We show that 
the traditional way of a relational representation is not feasible and give two practicable 
solutions to this problem. The basic idea of the first proposed mechanism is based on 
the article of [SmSm77]. Building a pyramid of concepts in a bottom up manner using 
regular relational views may be seen as a seamless extension of the traditional star/ 
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snowflake schema approach. The second top-down approach is based on object-rela- 
tional techniques. We demonstrate how we can take advantage of object-relational con- 
cepts like types, typed subtables, inheritance on types and tables, and references. Again 
this method results in a flexible schema design. 

References 

ALTK97 Albrecht, J.; Lehner, W.; Teschke, M.; Kirsche, T.: Building a Real Data 
Warehouse for Market Research, to appear in: 8th International Conference and 
Workshop on Database and Expert System Applications (DEXA’97, Sept. 1-5, 
1997) 

CoCS93 Codd, E.F.: Codd, S.B.; Salley, C.T.: Providing OLAP (On-line Analytical 
Processing) to User Analysts: An IT Mandate, White Paper, Arbor Software 
Corporation, 1993 

GBLP96 Gray, J.; Bosworth A.; Layman A.; Pirahesh, H.: Data Cube: A Relational 
Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total, in: I2th 
IEEE International Conference on Data Engineering (ICDE'96, New Orleans, 
Louisiana, Feb. 26 -Mar. 1, 1996), pp. 152-159 

GuMu95 Gupta, A.; Mumick, L: Maintenance of Materialized Views: Problems, Techniques, 
and Applications, in: IEEE Data Engineering Bulletin, Special Issue on 
Materialized Views & Data Warehousing 18(1995)2, pp. 3-18 
Hype99 N.N.: Hyperion Essbase OLAP Server, Hyperion Software Corporation, 1999 
(http://www.hyperion.com/essbaseolap.cfm) 

Inmo96 Inmon, W.H.: Building the Data Warehouse, 2nd edition. New York, John Wiley & 
Sons, 1996 

Kimb96 Kimball, R.: The Data Warehouse Toolkit, New York, John Wiley & Sons, 1996 
Lore87 Lorenzen, P.; Constructive philosophy, Amherst, Univ. of Massachusetts Press, 
1987 

Micr95 N.N.: Microstrategy 6, MicroStrategy Inc., 1999 

(http://www.microstrategy.com/Products/index.asp) 

Shos97 Shoshani, A.: OLAP and Statistical Databases: Similarities and Differences, in: 
Proceedings of the I6th Symposium on Principles of Database Systems (PODS’97, 
Tuscon, Arizona, May 12-14, 1997), pp. 185-196 
SmSm77 Smith, J.M.; Smith, D.C.P.: Database Abstractions: Aggregation and 

Generalization, ACM Transactions on Database Systems 2(1977)2, pp. 105-133 
SQL99 N.N.: ISO/IEC 9075: 1999, Informatik Technology - Database Languages - SQL, 
1999 

VaSe99 Vassiliadis, P; Sellis, T: A Survey of Logical Models for OLAP Databases. In: 
SIGMOD Record 28(1999)4 




Functional Dependencies in Controlling Sparsity of 
OLAP Cubes 



Tapio Niemi*, Jyrki Nummenmaa*, and Peter Thanisch^ 

* Department of Computer and Information Sciences, University of Tampere 
FIN-33014 University of Tampere, Finland 
{tapio, jyrki }@cs .uta . fi 
^ Institute for Computing System Architecture, 

University of Edinburgh, Edinburgh EH9 312, Scotland 
pt@dcs . ed . ac . uk 



Abstract. We will study how relational dependency information can be applied 
to OLAP cube design. We use dependency information to control sparsity, 
since functional dependencies between dimensions clearly increase sparsity. 
Our method helps the user in finding dimensions and hierarchies, identifying 
sparsity risks, and finally changing the design in order to get a more suitable 
result. Sparse raw data, a large amount of pre-calculated aggregations, and 
many dimensions may expand the need of the storage space so rapidly that the 
problem cannot be solved by increasing the capacity of the system. We give 
two methods to construct suitable OLAP cubes. In the synthesis method, 
attributes are divided into equivalence classes according to dependencies in 
which they participate. Each equivalence class may form a dimension. The 
decomposition method is applied when candidates for dimensions exist. We 
decompose dimensions based on conflicts, and construct new cubes for 
removed dimensions until no conflicts between dimensions exist. 



1. Introduction 

On-Line Analytical Processing (OLAP) has gained popularity as a method to support 
decision making in situations where large amounts of raw data should he analysed. In 
OLAP queries are made against structures called OLAP cubes. The design of a cube 
is based on knowledge of the application area and the types of queries the users are 
expected to pose. 

In practice, the users may want to speculate about the effects of, e.g. changing the 
way their company has arranged its organisation. Also, there may be needs to analyse 
data in ways, which were not anticipated at the time when the OLAP cubes were 
designed. It is common practice for some OLAP users to frequently reorganise their 
OLAP cubes even though the nature of the underlying raw data has not changed. 

Although there are various different ways in which the users may want to arrange 
the OLAP cube, there are also significant differences between different designs in 
terms of efficiency and practicality as some design can lead to. For example, a cube 
can be extremely sparse, in a sense that many of the data items are missing or zero 
because of the nature of the data, or some other designs can be problematic for query 
formulation (incorrect aggregations). 
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However, although there is some research done about the design goals for OLAP 
cubes, this research can be seen as far from being complete and many important 
design issues still require further attention. To date, the research has concentrated in 
physical design such as indexing and clustering. There is also some work done on the 
conceptual level description of OLAP applications using e.g. ER diagrams [2]. 

We divide OLAP cube design into three phases: conceptual modelling, logical 
design, and physical design. The borders of these are not very clear, and they are 
different from traditional database design. 

By conceptual modelling we mean describing and analysing the concepts and their 
relationships. The result of conceptual modelling is a conceptual schema, which 
explicitly present concepts and their relationships. The starting point of conceptual 
modelling varies. A situation is often that we have a pre-existing data warehouse. 
Other possibilities are that we only have an operational database ready, and we can 
even start designing the cube first and then design and implement the data warehouse 
for that cube. Logical design starts from the conceptual schema and its result will be 
an OLAP schema, i.e. cubes with dimensions and their hierarchies. Unlike traditional 
database design, designing of logical database schema, e.g. relational database 
schema [3], is not included in logical OLAP design but physical design. The aim of 
physical design is to find an efficient implementation for the desired data cube. The 
result can be a relational star (snow flake) schema or a multidimensional schema. 
Physical design also contains optimisation of the chosen storage structure, for 
example building indexes. 

Tbis work is organised as follows. In Section 2 we study logical design and its 
aims in more detail. Section 3 is a review of related work. In Section 4, we study 
different properties of dependency sets and their effects on sparsity. At the end of the 
section we give two different approaches to construct OLAP schemas avoiding 
sparsity. Finally, the conclusions are presented in Section 5. 



2. Logical Design 

The aim of logical design is to produce a good schema for an OLAP cube or a set of 
cubes. The high level aims are to produce a cube in which queries can be constructed 
easily, answered correctly, and evaluated efficiently. The first aim means that the 
system should be easy to use, and the user should be able to construct complex 
queries easily. By correctness we mean that the user gets a correct and relevant 
answer to the query. Efficiency means that the query can be evaluated quickly using 
minimal amount of resources. 

Tbe starting point of logical design is a formal conceptual schema of a data 
warehouse. The conceptual schema is supposed to be complete, i.e. it describes all 
information stored in the data warehouse. The conceptual schema contains concepts 
and relationships between them. In logical design, we operate with concepts and 
relationships in order to get a schema that represents the logical model of an OLAP 
cube. We can derive new concepts and relationships according to particular mles, and 
also remove concepts and relationships if the rules allow it. A basic type of these rules 
are the inference rules of the relational dependency theory [3, 4]. 

The result of logical design can be evaluated by comparing it to some normal 
forms. The normal forms will be something like relational normal forms: a particular 
normal form has some desirable properties related to the correctness of aggregation. 
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sparsity, etc. On logical level our high level aims are completeness, summarizability, 
and controlling sparsity. 

Controlling Sparsity. Sparsity means that an OLAP cube contains empty cells. A 
simple measure for sparsity is the ratio of empty cells compared to all cells. The ideal 
case is that sparsity is zero but this is seldom possible to achieve in real applications. 
Pendse has noticed that sparsity can increase more than exponentially in situations 
containing many pre-calculated aggregation values [12]. This kind of database 
explosion cannot be avoided using efficient storage methods for sparse data, because 
the real amount of data increases due to pre-calculated aggregation information. The 
only solution is to ensure that the cube, which stores the raw data, is not sparse. 

Completeness. Cabibbo and Torlone [1] mean by completeness that all information 
stored in operational databases is possible to achieve using the OLAP cube. 
Completeness in this sense is related to the data warehouse design rather than logical 
OLAP design. In OLAP, we can form the meaning of completeness in a different 
way: it should be possible to access any information stored in the data warehouse 
using an OLAP cube or cubes. 

Summarizability. Lenz and Shoshani have studied summarizability in OLAP [6]. 
They give three necessary conditions for summarizability, and they also assume that 
these conditions are sufficient. The conditions are: 

1. disjointness of category attributes, 

2. completeness, and 

3. correct use of measure (summary) attributes with statistical functions. 

Disjointness requires that attributes in dimensions form disjoint subsets over the 
elements. Completeness means that all elements exist and every element is assigned 
to some category in the level above it in the hierarchy. Correct use of measure 
attributes with statistical functions is the most complex of these requirements, since it 
depends on the type of the attribute and the type of the statistical function. Measure 
attributes can be classified into three different types, which are flow, stock, and 
value-per-unit. Knowing the type of an attribute we can conclude which statistical 
functions can be applied to the attribute. 



3. Approaches to Logical OLAP design 

The logical cube design is not a widely studied area. Most of existing works are 
related to physical design, i.e. they study methods to implement a cube efficiently. In 
this section we study some papers related to logical OLAP design. 

Pedersen and Jensen [11] discuss requirements for multidimensional data models. 
Their nine requirements are: 

1) explicit hierarchies in dimensions, 

2) symmetric treatment of dimensions and measures, 

3) support for multiple hierarchies in a dimension, 

4) correct aggregation of data, 

5) support for non-strict hierarchies, 

6) handling of many-to-many relationships between facts and dimensions, 

7) change of data over time should be supported, 

8) uncertainty could be associated to data, and 

9) data with different levels of granularity should be allowed. 
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Explicit hierarchies in dimensions means that hierarchies should be explicitly 
shown in the schema to help user’s understanding of hierarchical structures in 
dimensions. Symmetric treatment of dimensions and measures means that it should be 
possible to make a measure as a dimension and visa versa. It can sometimes be useful 
to group items according to an attribute usually used as a measure; e.g. to group 
products according to the profit they give. By multiple hierarchies in a dimension the 
authors mean that a dimension hierarchy is allowed to be an acyclic graph, that is, 
there can be several paths from the top level to the most detailed level. For example, 
the time dimension can have hierarchies like day-week, and day-month-year. 

The fourth item, correct aggregation of data, ensures that aggregated values are not 
misleading. The idea of the fifth item, support for non-strict hierarchies, is to allow 
many to many relationships between different levels in a dimension. Many to many 
relationships between facts and dimensions means that there can be many measure 
values per one combination of dimension values, i.e. the measure should be a set of 
values instead of one value. The situation can often be avoided by increasing 
dimensions but this can sometimes be unnatural. Time and changes are important 
aspects in OLAP, since OLAP is often used in analysing historic data. Time itself is 
usually treated as a dimension, but changes are more difficult, for example hierarchies 
can change in time. The eighth item means that the system should be able to handle 
uncertain or fuzzy information. Finally, the last item, data with different levels of 
granularity should be allowed, means that it should be possible to store data in 
different levels of hierarchy, not only to the lowest level. However, this easily causes 
problems in aggregations. 

The authors propose a model providing all nine properties. The model is an 
algebraic one, which may reduce its usability, but it has a large expressive power. 

Li and Wang [8] present a model for multidimensional databases. A database can 
contain several data cubes. They also define an extension for relational algebra, called 
grouping algebra, which can be used in OLAP queries. 

In some works the ER model [2] is applied to the cube design. Sapia et al. [13] 
extend the ER model by adding three primitives into it in order to allow expressing 
the multidimensional structure of data; a fact relationship set, a dimension level set, 
and a classification set. However, the method does not offer much help for the design 
process but the designer has to know what kind of structure is suitable for the current 
application. Further, the model does not have features to help to achieve desirable 
properties of the OLAP cube (summarizability, efficiency). 

Golfarelli et al. [5] present a semi-automatic modelling method based on an 
existing ER schema. The method is called Dimensional Fact model. It can 
automatically find possible dimension hierarchies and allows the user to modify them 
by adding or removing levels. Summarizability is also noticed; measure attributes are 
classified additive, semi-additive, and non-additive. Measures are supposed to be 
atomic. If many measures are needed, several cubes can be constructed. Two cubes 
are defined to be compatible if they have at least one common dimension. To ensure 
completeness in aggregations, many to one mapping has to be hold between levels in 
a dimension. The algorithm for building dimensions may help to get orthogonal 
dimensions, but this is not studied in the paper. However, some relationships may be 
lost while construction dimensions, and they may cause problems with orthogonality. 

A method by Cabibbo and Torlone [1] is based a multidimensional data model, 
called MD, which main parts are a dimension and a fact table. The method has some 
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similarities to the method of Golfarelli et al. They both start from the existing ER 
schema of an operational database. The method by Cabino and Torlone has four steps: 

1. identification of measures and dimensions, 

2. restructuring the pre-existing ER schema, 

3. derivation of the dimensional graph, and 

4. translation into the MD model. 

The first two steps can be parallel and they are often iterated. The dimensional 
graph is used to represent measures and dimensions. It can be constructed from the 
restructured ER schema, and the dimensions can be easily derived from the 
dimensional graph. The MD model does not explicitly offer help in designing OLAP 
cubes with good properties for aggregation correctness and controlling sparsity. 

Lehner et al. [7] present normal forms for multidimensional databases. The normal 
forms are designed to help in ensuring summarizability and controlling sparsity. The 
normal form requires that there exists a functional dependency between the 
hierarchical levels in a dimension, but no functional dependency is allowed between 
different dimensions. The normal form presented helps to avoid problems with 
incorrect aggregation and sparsity but the normal form itself may sometimes be 
impossible to achieve in practise, because totally independent dimensions are not 
always possible to find in many applications. The orthogonality of dimensions is 
studied only with respect to (unary) functional dependencies, and summarizability is 
not understood in as large a sense as in the paper by Lenz and Shoshani [6]. 

Wang and Li [14] discuss the use of orthogonality in optimising search for 
summary data. Although the aim of the paper is to improve efficiency of query 
processing, it discusses some topics related to logical design, such as the relationship 
between multivalued dependencies and orthogonality: if there is a multivalued 
dependency between dimensions, the dimensions are orthogonal. 



4. Applying Dependency Information in OLAP Design 

In this section we show how functional dependencies [4] can be applied to logical 
OLAP design. We assume that the reader is familiar with some dependency theory. 



4.1 Properties of Dependency Sets 

In this subsection we study dependency sets and their properties in the spirit of 
Nummenmaa [10]. Nummenmaa has noticed that certain type of dependency sets lead 
to bad relational database schemas. In what follows we show that these results can 
also be applied in OLAP design. We study how certain properties of a dependency set 
reflect to properties of the resulting OLAP cube. We concentrate only in functional 
dependencies (FDs). The set of LDs is denoted by F, and the set of left hand sides of F 
by LHS(F). The dependency set is called unary, if it contains only FDs with a 
singleton left hand side. Unary dependency sets are easier to handle than non-unary 
sets, because of simple inference rules, but expressive power decreases. 

We will need the following definitions. If X^Y and there is no such X’CX that 
X’^Y, then X^Y is a full dependency, and we write X>-^Y. We say that Xris a 
closure of X, i.e. X"^ contains all attributes that X determines directly or transitively. 
We write X^ =[A I X ^^A], X-=X-" - X, and X"=X^ - X. 
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RHS-inter section free. A dependency set is RHS-intersection free, if for any X, Y g 
LHS(F), whenever X"n Y"=/= 0, then either X^Y or Y ^X. [10] 

An RHS -intersecting dependency set can still be unary thus RHS-intersection free 
property is a natural extra demand for unary dependency sets. 

Split free. We say that X splits Y if X^nY=/=0 and not X^Y. A set of FDs is split 
free if no XgLHS(F) splits any YgLHS(F). [10] 

Every unary dependency set is also split free. Splitting often exists together with 
RHS -intersecting property. 



4.2 Estimating Sparsity 

By minimum sparsity we mean the theoretical minimum amount of sparsity in a cube. 
We usually have some design constraints, like FDs, for logical cube design. As an 
example, we have a cube with three dimensions: employee, customer and product. 
The measure is the amount of sold products. With no FDs (or other constraints), the 
minimum sparsity is zero. However, if we know that an employee sells only one 
product, we get a FD employee ^product. Now, the cube is necessarily sparse, 
because most of product-employee combinations will have an empty value. If we 
have 10 employees, 20 customers and 5 products, then the size of the cube is 1000 
cells. We can slice the cube according to employees and get 10 sub cubes. Every sub 
cube can have at most 20 non-empty cells, since an employee can sell only one 
product but maybe to all customers. Eurther, the number of non-empty cells is 
10x20=200, while the total number of cells is 1000, thus the sparsity in the cube 
cannot be less than 80%. Next we study how different properties of dependency sets 
effect sparsity. 

Unary Dependency Sets. Let C(U) be a data cube schema, U a non-empty set of 
attributes forming the dimensions and D a set of unary EDs. Let X=U, and for all such 
EDs A^B gD that BgX, remove B from X. Let c be an attribute. The notation Id 
means the amount of different values of c. The minimum sparsity of the cube is now 

xeX cell 

If the dependency is A^B and B^A (a key dependency), A or B can be removed 
but not both of them. A key dependency is worse for sparsity than a normal FD. It 
leads to even sparser cubes, as values are spread diagonally (Figure 1). 
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Fig. 1. Two cubes with different dependencies: 
i) SSN^ empl. num and empl. num ^SSN, ii) SSN^name 



RHS-intersecting Dependency Sets. If a dependency set is not RHS-intersection 
free, it is not possible to construct one non-sparse OLAP cube because of a 
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RHS-intersection(s) between some dimensions. The following example illustrates 
this. We use a schema drawing technique simplified from the one presented by Niemi 
et al. [9]. Text boxes represent attributes, and arrows functional dependencies. A two 
head arrow means a symmetric functional dependency, i.e. a key dependency. 




Fig. 2. Employee and customer use only one office 

In the schema above, employee and customer do business only in one office which 
is always the same for both of them. If the number of product items sold is the 
measure, possible dimensions are; product, employee, customer, and office. Further, 
the dependencies are: product_sold^customer employee product office, 

customer^office, and employee ^office. A natural choice for dimensions is product, 
employee, and customer. A problem in the dependency set is that both customer and 
employee determine office but there is neither dependency customer ^employee nor 
employee ^customer, i.e. the dependency set is RHS-intersecting. From this it 
follows that only such employee-customer pairs that use the same office can have a 
non-empty measure. The situation remains, if we choose the office as a dimension. In 
this case, the minimum sparsity can be calculated using a formula l-((lcustomerl 
lemployeel IproductI) / (Icustomerl lemployeel IproductI lofficel)). 

Non-unary FDs. In this subsection we study non-unary dependency sets and their 
properties. 




Fig. 3. date person ^ car 
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Fig. 4. date person^ car, car^ person 



In Figure 3, dependencies date person ^car and repair^car make the dependency 
set RHS-intersecting. For this reason the cube will be sparse if we choose date, 
person, and car as dimensions, because a car can be repaired only once per one 
date-person combination. We can extend the calculation rule of minimum sparsity to 
non-unary FDs simply allowing the left hand side to contain attributes from several 
dimensions. In Figure 3, the minimum sparsity will be l-((ldatel IpersonI) / (Idatel 
IpersonI Icarl)). We could get a better cube if only date and person were the 
dimensions but then it would be difficult to analyse the repairing history of a 
particular car. To avoid the problem we can construct multiple cubes with 
dimensions: car, date; car, person; and data, person. 

In Figure 4 we have a situation where the car splits the set {date, person}, since the 
{ car j'^n (date, person}=/=0 but the car does not determine (date, person). In practice 
this means, that the same person always brings the car to the garage, and a person can 
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bring only one car a day to the garage. If car, date, and person are the dimensions, the 
sparsity will be l-((ldatel I personi) / (Idatel IpersonI Icarl)) that is the same as without 
the car^person dependency. However, the car^person dependency slightly increase 
the sparsity by demanding that there has to be at least as many cars as persons. 

Finally, the last case is that the left hand sides of dependencies intersect. An 
example on this is seen in Figure 5. However, the example schema is also 
RHS-intersecting. The LHS intersection is caused by the date attribute being a part of 
the left hand side for both owner data ^discount, and person date^car dependencies. 
LHS-intersection does not seem to be very problematic for OLAP cubes, because 
these situations are usually RHS-intersecting, too. Therefore we do not study 
LHS-intersection more detailed in this work. 



[discount k — I repair j — » | car 




owner 




date 



person 



Fig. 5. date person^car, date owner^discount 



Normal Forms. According to the properties of the dependency sets studied above, we 
can define normal forms to avoid those problematic situations. Relational normal 
forms, except the first one, are hardly suitable for OLAP cubes, because of different 
aims in operational and OLAP databases. In general, the aims of OLAP normal forms 
are to ensure both minimal amount of sparsity and correctness of aggregations, but we 
concentrate mostly to controlling sparsity in this work. OLAP normal forms can be 
classified into three categories. Multidimensional normal forms are related to 
dependencies between dimensions, dimensional normal forms concern 
intra-dimensional dependencies, and first normal form dependencies between the 
dimension set and the measure attribute. 

Following Lehner et al. [7], a data cube is in: 

■ dimensional normal form if there is a FD from each level to the next level in a 
hierarchy. 

■ multidimensional normal form if it is in dimensional normal form and there are no 
FDs between dimensions. 

The aim of the dimensional normal form is to guarantee completeness in 
aggregations, while the multidimensional is mostly used in controlling sparsity of the 
OLAP cubes. 

We define two new normal forms for a multidimensional OLAP cubes. The normal 
forms are based on the properties of functional dependency sets. We say that an 
attribute participates in a dependency if it exists on the left or the right hand side of 
the dependency. A data cube is in: 

■ RHS intersection normal form if it is in multidimensional normal form and there 
are no RHS-intersecting dependencies between dimensions. 

■ Split free normal form if it is in RHS intersection normal form and there are no 
Xg LHS(F) that splits any Yg LHS(Y) that belongs to a different dimension. 
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4.3 Producing Normalised OLAP Schemas 

We can recognise two different situations in logical OLAP design. In the first one we 
know the dimensions, and want to decompose them based on FDs. In the second one 
we only have a conceptual schema with the dependency information and we want to 
build the dimensions. The first case is analogical to relational decomposition and the 
second one to synthesis. In decomposition we decompose to reduce conflicts, and in 
the synthesis we combine compatible attributes in the same dimension. It is typical for 
decomposition that it results in too many dimensions (loss of information and need to 
combine data from several dimensions) and synthesis in too few (sparsity). 

Synthesis. We constmct equivalence classes of attributes based on functional 
dependencies they participate. The measure attribute cannot be taken into account, 
because it usually functionally determines all other attributes. Let U be a set of 
attributes, M the measure concept (the concept whose some attributes are the actual 
values to measure), S=U-{M}, and F the set of functional dependencies of attributes 
in S. We form an undirected dependency graph for S as follows. All attributes of S 
will be nodes. Let A, Bg S. There is an edge between A and B if and only if there is 
such a FD X^Y g F that AgX and BgY, or AgY and BgX. Further, A and B 
belong into the same equivalence class, if there is a path in the dependency graph 
between A and B. 

An equivalence class can form a dimension if it is in dimensional normal form [7]. 
The equivalence classes cannot be shared into different dimensions, because in that 
case we necessarily would have a dependency between dimensions. A dimension 
cannot contain attributes from several equivalence classes, because according to the 
dimensional normal form there has to be a functional dependency between the 
attributes in the same dimension. 

Example 1. Figure 6 illustrates the algorithm. We want to measure the profit of sold 
products, i.e. the product_sold is the concept to measure and profit the actual value to 
measure. We can also notice that we have a FD from the product lot to the employee 
(for example an employee sells the products (s)he buys), thus the employee cannot be 
an individual dimension but it can be used to classify products, i.e. we can group 
products according to employee bought them. 




Fig. 6. OLAP schema of a company. Profit is the measure. Potential dimensions are time, 
customer, and product lot. Employee cannot be a dimension 
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The set S will be {time, employee, empl_id, e_name, customer, cus-id, c_name, 
product_lot, prod_num, own_price, p_name, prod_group}. The set 
D=(employee *empl_id e_name, customer • cus-id c_name, product_lot •employee 
product_num own_price p_name, p_name *prod_group, cus_id •customer, 
empl_id •employee, prod_num •product}. We get three equivalence classes {time}, 
{employee, empl_id, e_name, product_lot, product_num, own_price, p_name, 
prod_group}, and {customer, cus_id, c_name}. All equivalence classes are in 
dimensional normal form, so they can form three dimensions. 

Decomposition. We can use a decomposition technique when initial dimensions are 
given. In decomposition we construct multiple cubes to avoid dependencies between 
dimensions. We can divide a cube using dimensions or values of a dimension. 

We assume that dimension candidates are in dimensional normal form. In the first 
case conflicting dimensions are removed until the desired normal form is achieved. A 
new cube is constructed from the removed dimensions and then the process is 
repeated for it. The resulting OLAP schema contains multiple cubes having zero as 
the minimum sparsity. Example 2 illustrates this. 

The second possibility is to divide a cube according to some grouping attribute. For 
example for the schema presented in Figure 2, we can construct a separate cube for 
every office. However, this usually violates the completeness condition, since 
completeness can be understood so that if we have some attribute as a dimension, then 
it has to contain all values of that attribute. The two methods can also combined. 



Example 2. Normalising the schema in Figure 2 gives the tables in Figure 7. 






A 



A 



product 



product 



product 



employee 






> 

customer 



office 



> 



Fig. 7. Normalised OLAP schema 



The minimum sparsity of the OLAP schema in Figure 7 is zero. We cannot form a 
non-sparse cube with employee and customer as dimensions (every employee cannot 
do business with every customer, if there are customers using different offices). The 
normalisation is not lossless; we lose a connection between employees and customers. 
However, if we allow some sparsity and also constmct a cube with employee and 
customer as dimensions, no information will be lost and total sparsity is still much 
less than in a case of one four dimensional cube. 



5. Conclusions 

We have studied functional dependencies in the context of logical OLAP cube design. 
The OLAP cube should be suitable in two aspects: sparsity and correctness of 
aggregations. Using functional dependency information we can achieve better results 
in both of the aims. 
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OLAP schema should not contain any functional dependencies between different 
dimensions hut there should he a functional dependency from each level to the higher 
level in a hierarchy [7]. Moreover, we have noticed that there should not exist such 
situations that attributes belonging into different dimensions determine some another 
and the same attribute or attributes together determines another attribute (non-unary 
FDs). Independences of dimensions guarantee the minimal amount of sparsity while 
functional dependencies between hierarchies ensure completeness in aggregations. 
Furthermore, to correct aggregations also the types and semantics of attributes has to 
be known to be able to know which statistical functions can be applied. 
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Abstract. Collecting and mining web log records (WLRs) from e-commerce 
web sites has become increasingly important for targeted marketing, promo- 
tions, and traffic analysis. In this paper, we describe a scalable data ware- 
housing and OLAP-based engine for analyzing WLRs. We have to address 
several scalability and performance challenges in developing such a frame- 
work. Because an active weh site may generate hundreds of millions of WLRs 
daily, we have to deal with huge data volumes and data flow rates. To support 
fine-grained analysis, e.g., individual users’ access profiles, we end up with 
huge, sparse data cubes defined over very large-sized dimensions (there may 
he hundreds of thousands of visitors to the site and tens of thousands of 
pages). While OLAP servers store sparse cubes quite efficiently, rolling up a 
very large cube can take prohibitively long. We have applied several non- 
traditional approaches to deal with this problem, which allow us to speed up 
WLR analysis by 3 orders of magnitude. Our framework supports multilevel 
and multidimensional pattern extraction, analysis and feature ranking, and in 
addition to the typical OLAP operations, supports data mining operations such 
as extended multilevel and multidimensional association rules. 



1 Introduction 

Commercial web sites typically generate large volumes of web log records (WLRs) 
daily. These WLRs can be collected and mined to extract customer behavior patterns, 
which may then be used for a variety of business purposes such as making product 
recommendations, designing marketing campaigns, or redesigning the web site. 
Numerous commercial tools (e.g., WebTrends [13] and Net.Genesis [12]) are avail- 
able for analyzing WLRs (and other data sources) and generating reports for business 
managers [3,7,11], However, these tools typically provide only a fixed set of pre- 
configured reports, have limited on-line analytical capabilities, and do not support 
more sophisticated data mining operations such as customer profiling or association 
rules. 

On-line analytical processing (OLAP) tools are designed to support complex, 
multi-dimensional and multi-level on-line analysis of large volumes of data stored in 
data warehouses [1,2-4,8,10]. In our prior work, we have described a scalable 
framework developed on top of an Oracle-8 based data warehouse and a commer- 
cially available multi-dimensional OLAP server, Oracle Express, which we have 
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used to develop applications for analyzing customer calling patterns from telecom 
networks and shopping transactions from e-commerce sites [6], In this paper, we 
describe a Web access analysis engine implemented on this framework to support the 
collection and mining of WLRs at the high data volumes typical of large commercial 
Web sites. In developing this engine, we encountered additional performance and 
functionality problems, which we address in this paper. 

Typically, hundreds of millions of WLRs are created daily, and the architecture 
must support loading and processing rates that match the input rate. Further, since 
WLRs are continuously collected, it is important to mine them in Teal-time” to 
dynamically detect trends and changes in traffic patterns, so as to dynamically make 
business decisions (e.g., recommending products, targeting advertisements, offering 
promotions). The results of the analysis are represented by summary information at 
multiple levels of granularity, e.g., hourly, daily, weekly, and monthly, as well as 
along multiple dimensions. These summaries must be stored and incrementally u p- 
dated. 

While a data warehouse/OLAP framework is capable of dealing with huge data 
volumes, it does not guarantee that the summarization and analysis operations can 
scale to keep up with the input data rates. Specifically, for Web access analysis, we 
want to introduce a number of fine-grained dimensions, resulting in very large, very 
sparse data cubes, which pose serious scalability and performance challenges to data 
aggregation and analysis, and more fundamentally, to the use of OLAP for such 
applications. For example, in one application, a newspaper Web site received 1.5 
million hits a week against pages that contained articles on various subjects. We 
wanted to profile the behavior of visitors from each originating site at different times 
of the day, including their interest in particular subjects and wbicb referring sites 
they were clicking through. We modeled the data using four dimensions: ip address 
of the originating site (48,128 values), referring site (10,432 values), subject uri 
( 1 8,085 values), and hours of day (24 values). The resulting cube contained over 200 
trillion cells! (Clearly, the cube was extremely sparse.) Each of the dimensions par- 
ticipated in a 2-level or 3-level hierarchy. We estimated that to rollup such a cube 
along these dimension hierarchies using the regular rollup operation supported by 
tbe OLAP server would require some 10,000 hours (i.e. more than one year), on a 
single Unix server! The problem is that while most MOLAP or ROLAP engines 
provide efficient mechanisms for caching and storing sparse cubes, they lack effi- 
cient mechanisms for rolling up such cubes. 

We have introduced several ‘hon-traditional” OLAP solutions to handle the above 
scalability issues for our Web access analysis engine. We develop approaches for 
dealing with very large, sparse cubes. We introduce the notion of high-diagonal 
cubes to replace the traditional “embedded-total” cubes (in which all intermediate 
summaries all the way to the top of each dimension hierarchy are computed at load 
time), and we use direct binning, instead of rollup, to populate the high-diagonal 
cubes. We further reduce tbe computation load by selecting high-profile dimension 
elements. These approaches allow us to speed up Web log analysis by three orders of 
magnitude. Additional bookkeeping is required to maintain the relationships b e- 
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tween the high-level data cubes containing aggregates and the low-level data cubes 
containing detailed data, thus allowing users to drill-down selectively. 

This data-warehouse/OLAP framework and Web access analysis engine have been 
implemented at HP Labs. Our experience has demonstrated that it is possible to 
overcome the performance problems of handling sparse data cubes, and to automate 
the whole operation chain, including data filtering, loading, incremental summar i- 
zation and analysis. The application, including the optimizations described in this 
paper, was implemented by OLAP programming in the scripting language provided 
by the OLAP server. Further, as we described in [5,6], we use OLAP programming 
to compute various classes of multi-level and multi-dimensional patterns and asso- 
ciation rules. In particular, we had introduced several families of association rales 
such as scoped association rules and functional association rules. We show how 
these classes of patterns and association rales can be used in analyzing WLRs, and 
we define a new class of time-variant rules, which are also useful for Web access 
analysis. Thus, we use the OLAP server as a computation engine to support data 
mining operations. In this respect, we are in agreement with the approach described 
in [9] that uses OLAP tools to support large-scale data mining. 

Section 2 introduces the architecture of the OLAP-based Web access analysis m- 
gine and shows how it automates the operation chain. Section 3 discusses the seal- 
ability issues. Section 4 illustrates pattern and rale discovery for Web access analysis. 
Finally, Section 5 gives some conclusions and summarizes lessons learnt. 



2 Data Warehouse/OLAP Based Web Access Analysis Engine 

Almost all e-commerce applications are Web based. Web log records (WLRs) are 
generated to represent information specific to each Web access attempt. Each WLR 
typically contains, among other things, the IP address of origin site, the access time, 
the referring site, the URI of the target site (i.e., the Web page or object accessed), 
the browser method and protocol used. 

There are two general tasks for Web access analysis: (1) compute multi- 
dimensional summary information from a number of raw WLRs; and (2) derive 
usage patterns and rales for supporting business intelligence. Below are some exam- 
ples. 

Usage Analysis. The volume and distribution of hits for specific topics, dimensioned 
by origin site, referring site and time at multiple levels, can be used as quantitative 
measures for personalizing the delivery of content to customers in different areas and 
at different times. 

Web Site Traffic Analysis. The volume and distribution of hits for target sites, 
dimensioned by referring site and time, can be used for resource and network plan- 
ning distributing workload over multiple sites, creating mirror sites, or caching co n- 
tent. 

Business Rules Discovery. The change of access rates to a Web site can provide 
indications of changing customer interests and behavior. For instance, the correl a- 
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tion between a content topic and certain origin sites in an area describes the interest 
of the customers in that area. While such relationships are helpful for making ma r- 
keting promotion decisions, the changes in such relationships may he even more 
significant, since such changes usually reflect real-time trends of changes in custom- 
ers’ interest, reactions to a marketing campaign, as well as the impact of compet i- 
tors. To catch such relationships requires us to mine for association rales continu- 
ously and incrementally. 

We measure Weh access in terms of volumes and prohahility distrihutions, which 
are expressed in the form of data cubes. A cube C has a set of underlying dimensions 
Di,... ,D„, and is used to represent a multidimensional measure. Each cell of the cube 
is identified by one element value from each of the dimensions, and contains a value 
of the measure. We say that the measure is dimensioned by Dj,... ,Dn . The set of 
elements of a dimension D, called the domain of D, may be limited (by the OLAP 
limit operation) to a subset A suh-cuhe (slice or dice) can he derived from a cube C 
by dimensioning C by a subset of its dimensions, or by limiting the domains of its 
dimensions. 

For example, a cube measuring Web hit volumes is dimensioned by the IP ad- 
dresses of origin sites, the target URI, the referring sites, and hours in a day, as 

define EXPvolume variable int <hour ip refuri> 

In designing the dimensions of the cube, we have to decide the finest level of 
granularity at which we want to do the analysis. For instance, we have chosen hours 
as the finest time granularity, even though the raw WLRs contain time data at an 
even finer granularity (seconds). The mapping between the fields of the WLR and 
the corresponding dimension values is referred to as binning. 

Various cubes can be derived from the above basic cube as formulas. The ability to 
use formulas to define measures over a multi-dimensional space is a powerful feature 
of OLAP tools. Further, cubes can be computed from other cubes using OLAP pro- 
gramming in the scripting language provided by the OLAP engine. 

Our infrastructure is illustrated in figure 1. WLRs may be kept in log files, or 
stored in the warehouse together with other reference data. In the latter case, WLRs 
are fed to the data warehouse periodically or continuously, and retired to archive 
after use, under data staging control. 



OLAP client 




store back 




Fig. 1. OLAP-based Web access analysis automation 
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The infrastructure supports the computation of summary cubes and multidimen- 
sional and multi-level patterns and rules, based on both volume and probability di s- 
tributions. The resulting summary cubes are stored hack in the data warehouse, and 
reloaded into the OLAP server for incremental update. The entire operation chain 
from loading WLRs to computing summaries, patterns and mles, and posting the 
results on the Web is automated using OLAP programming. 



3 Scaling Up Web Access Analysis 

Scalability is the major challenge to any Web access analysis system. As we cfe- 
scribed in Section 1, one significant performance problem is the prohibitive cost of 
the traditional rollup operation for very large, sparse cubes. Application-level solu- 
tions to this problem are rarely provided. We have applied several ‘hon-traditionaT 
OLAP operations to deal with this problem, which allow us to speed up Web log 
analysis by 3 orders of magnitude. Before explaining these solutions, let us first ex- 
amine the typical cube rollup operation. 



3.1 Typical Cube Rollup Operation 

Elements of a dimension may form a hierarchy. A hierarchical dimension D contains 
elements at different levels of abstraction. Associated with D there are a dimension 
DL describing the levels of D, a relation DL_D mapping each value of D to the ap- 
propriate level, and a relation D_D mapping each value of D to its parent value (the 
value at the immediate upper level). To rollup cube C along dimension Z), the meas- 
ure value at a higher level is the total of the measure values at the corresponding 
lower levels. A cube may be rolled up along multiple dimoisions. 

In the application described in this paper, we consider origin, subject, ref site as 
high-level dimensions of ip, uri, re/ respectively. In Oracle Express, the mappings 
between them can be defined by relations originjp, subject_uri and refsite_ref. 
Below are some mapping examples. 

□ ip :63. 211. 140.164 — > origin : CA 

□ uri:www.exp.com/TODAY/topstory.html-^subject:www.exp.com/TODAY/ 

□ ref: www.ycihoo.com/entertaintment/book/book-store — > refsite: www.yahoo.com/ 
In the traditional OLAP approach, one defines a cube with multiple hierarchical 

dimensions, where each dimension has elements at more than one level. Eor exam- 
ple, one can define a dimension from-site with elements at ip level and origin level, 
drawn from dimensions ip and origin respectively; a to-.site dimension with elements 
at uri level and subject level; a via-site dimension with elements at ref level and 
refsite level. The mappings between elements at different levels are based on the 
relations defined above. Then, a cube recording the volume of hits may be defined as 
volume <from-site, to-site, via-site, hour> 

When this cube is rolled up over all dimensions, it contains all the sub-totals of the 
original cells, for multiple dimensions and at multiple levels; this is referred to as 
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embedded-total. When the original cube has multiple large-sized dimensions, a large 
number of additional cells are needed to hold the embedded-total. In the above ex- 
ample, these sub-totals occupy approximately 50 trillion cells in the rolled up cube, 
many of them nulls, out of a total of 267 trillion cells. While the OLAP engine is 
designed to compress sparse cubes for storage, the cells containing nulls must be 
checked in some way during the rollup operation. Consequently, handling and rol 1- 
ing up such a cube as a whole is impractical. 



3.2 Scalability Enhancements with Diagonal Aggregation 



Our solution to the above problem is simply not to manipulate such a cube with 
large-sized dimensions as a whole. We define another, relatively smaller, cube to 
hold aggregated values with the following basic requirements: it provides high-level 
abstraction; and it maintains the relationships between dimension elements at diffe r- 
ent hierarchical levels to allow drill-down. 

As shown in Figure 2, we represent Web access volumes at basic and aggregate 
levels by the following separate cubes. 

□ Basic Volume Cube (BVC), which takes into account all individual WLRs. This 
is the cube we defined earlier 

EXPvolume variable int <hour sparse <ip refuri» 

□ High-Diagonal Cube (HDC), which represents the summary information with 
respect to the parent dimensions of ip, uri, and ref, i.e. origin, subject, ref site re- 
spectively, as well as the hour dimension. N:1 mappings from ip to origin, from 
uri to subject, and from ref to ref site, are provided. The HDC in our example is 
defined as 

EXPvolume. high variable int <hour sparse <origin refsite subject» 

The HDC, Expvolume.high, is a summarization of the corresponding BVC, 
EXPvolume, aggregated over all dimensions. In this way Expvolume.high contains 
much fewer cells than Expvolume, and hence is easy to manipulate with reasonable 
performance. 




/ 


7 




7 



HDC: EXPvolume.high 



BVC: EXPvolume 



Fig. 2. Conventional cube rollup (left) vs. diagonal aggregation without rollup (right) 



Note that Expvolume.high does not contain the partial aggregates of Expvolume, 
namely, the aggregates along one or more, but not all, dimensions. These aggregates 
can be selectively generated on demand as query results. For example, to drill down 
a Expvolume.high cell with www . yahoo . com as refsite, relation refsite_ref can be 
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used to relate it to a set of lower-level elements of dimension ref, such as 
WWW . yahoo . com/ entertaintment/book/book-store, that underlie a sub-cuhe 
of EXPvolume (Figure 3). Since such query operations dealing with suh-cubes, they 
are relatively inexpensive. 



BVC: 

EXPvolume 




HDC: EXPvolume. high 



Fig. 3. Drill down from HDC to BVC based on query 



3.3 Scalability Enhancements with Direct Binning rather than Rolling Up 

Since EXPvolume is a sparse cube with large- sized dimensions, to generate the de- 
rived cube EXPvolume.high from it is rather expensive. Conversely, the WLRs, 
either stored in files or database tables, are not sparse. Very often, for each batch 
load, the number of WLRs is much less than the number of cells of the EXPvolume 
cube. Eor our application, we had millions of WLRs, but the EXPvolume cube had 
billions of cells. In this case, populating and updating a EXPvolume.high directly 
from log files, can reduce both memory load and computation load, compared with 
deriving it from EXPvolume. We call this mechanism direct binning. 

Consider a simple case (Figure 4), where we have a volume cube with k dimen- 
sions Di,... ,Dk, and each dimension is extended to include a single high-level ele- 
ment top’. To populate a summary cube containing tbe total as well as all the su b- 

k 

totals wrt each dimension element, each WLR contributes to ^ cells, where 

;=0 

only one cell is for the base data, all others are for the above total and subtotals. In 
our example, the EXPvolume.high cube has 4 dimensions, therefore each WRL is 

used to update C^+ Cl +C\ + C^+ C^- 16 cells during direct binning. For 

sparse cubes with large dimensions, where the ratio between the numbers of high- 
level cells and input records falls within a certain range, directly populating high- 
level cells outperforms rollup. 

HDC: 

EXPvolume.hig 

Fig. 4. Populating high-level cells of HDC during loading 
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3.4 Scalability Enhancement with High-profile Cubes 

Limiting dimension elements to those that underlie cells with large counts, is an- 
other way to achieve data reduction. A cube defined on the limited dimensions con- 
tains fewer cells, thus is easy to manipulate. Using this approach, some cells con- 
taining small or zero counts will he dropped, hut these are insignificant for most 
applications. 

In our application, we introduced a high-profile cube (HPC) that is a sub-cube of 
HDC by taking into account only the the high-profile elements of dimensions origin, 
subject, refsite, i.e., those elements that correspond to Web access hit rates above a 
given threshold. The HPC in our example is defined as 

EXPvolume.top variable int <hour sparse <toporigin toprefsite topsubject» 
The high-profile elements of a dimension are identified in the following way. 
Given a volume cube C[Di, ... ,DJ that measures hit counts, a dimension D, ej Dj, 
... ,D„ }, a filter ratio 0 <k < 1 wrt to the average count over the element of Z)„ the 
threshold t is defined by the ratio of average counts per element of dimension T)„ as 
total(C)/(size(Di) * k), where total(C) is the total counts of hits, and size(Di) is the 
number of elements in Z),. Those elements of D, with subtotal counts over the thresh- 
old are considered ‘high-profile” ones. For example, for cube EXPvolume.high, the 
total counts can be calculated by the following. 

EXPtotal - total(EXPvolume.high) 

Dimensioned totals can be calculated by the following. 

EXPbyorigin = total(EXPvolume.high, origin) //dimensioned by origin 
EXPbysubject = total( EXPvolume.high, .subject) //dimensioned by subject 
EXPbyrefsite = total(EXPvolume.high, refsite) //dimensioned by refsite 
Then, for example, the threshold for dimension origin is determined by 
threshold.origin = EXPtotal/size( origin) *k 
The high-profile elements of origin are extracted by 
limit origin to EXPbyOrigin > threshold.origin 
and then loaded to a separate dimension toporigin. 

The elements of dimension toporigin, topsubject and toprefsite are subsets of 
those of origin, subject and refsite respectively. Therefore, total hits and their prob- 
ability distributions must be calculated over EXPvolume.high for accuracy. 



3.5 Overall Performance Comparison 

In summary, we represent Web access volumes by separate volume cubes, a BVC 
(EXPvolume), an HDC (EXPvolume.high), and an HPC (EXPvolume.top). The table 
in Figure 5 shows how data can be reduced with this approach. It is easy to see from 
this table that without such data reduction, carrying out Web access analysis with 
OLAP is impractical. Figure 6 shows that the proposed approach outperforms the 
conventional approach dramatically. The comparison illustrates the practical value of 
our approach for handling the given class of application. With our approach, low- 
level details (measured by BVC) and high-level summaries (measured by HDC and 
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HPC), as well as links between them (measured by the relations between corre- 
sponding dimension elements), are made available. Information not directly covered 
by these cubes may be computed by queries involving relatively inexpensive sub-cube 
manipulations. 



Dimension Sizes 


Ip 




origin 


90 




35 


Uri 


18,085 


subject 


229 




32 


Ref 


10,432 


refsite 


2,167 




25 


Hour 


24 


hour 


24 


1 hour 


24 


Cube Sizes 


EXPvolume 


217.919 

billion 


EXPvolume.high 


1.6 

billion 


EXPvolume.top 


0.000672 

billion 



Fig. 5. Scalability improvement by data reduction 



Conventional approach 


Loading cube EXPvolume 


1 hour 


Rollup Expvolume (by estimation) 


10,000 hours 


Total estimated time 


10.000 hours 


Proposed ap 


proach 


Loading Expvolume 


1 hour 


Direct binning Expvolume.high 


1.2 hour 


Generating EXPvolume.top 


0.3 hour 


Total time 


2.5 hour 



Fig. 6. A comparison of performance 



4 Web Usage Analysis 

We now describe the typical Web usage analysis functions supported by our system. 



4.1 Multilevel and Multidimensional Analysis 

As discussed above, cubes representing multidimensional Web access volumes are 
generated at three levels: the basic level, the summary level, and the top level. Vari- 
ous Web access patterns can be derived from these cubes. They may be used to repE- 
sent the access behavior of a collection of users or a single user; they may be based 
on volumes or probability distributions; and they may be materialized (defined as 
variables) or not (defined as formulas). Below are some examples. 
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Multidimensional Patterns. A cube representing the access volumes by hour for the 
most popular subjects and top referring sites from users in France, can be defined by 
tbe following formula (view), derived from cube Expvolume. top by 

define VolFromFrance.top formula int <hour, toprefsite, topsubject> 

FQ EXPvolume.top(toporigin France) 

Multilevel Patterns. Using the relations originjp, refsite_ref and subjectjuri, one 
can ‘drill down” from a specific cell in cube FXPvolume.high, 

EXPvolume.high(hour 12,’ origin France,’ refsite ivww.yahoo.com/) 
subject ivww.exp.com/EXP/TODAY) 
to identify a sub-cube of EXPvolume, through the following operations. 
limit ip to origin_ip France’ 
limit ref to refsite_refivww.yahoo.com’ 
limit uri to subject _uri FXP/TODAY’ 
limit hour to 12’ 
report FXPvolume 

Probability Distribution based Patterns. Cubes representing probability distribu- 
tion based patterns are derived from volume-based pattern cubes. They provide fine- 
grained representation of dynamic behavior. Given cube FXPvolume.high, for exam- 
ple, the volume cube dimensioned by hour and subject is defined by 
define VolByHourBy Subject formula int <hour, subject> 

EQ total) FXPvolume.high, hour, subject) 

The cube representing probability distributions of the above information over all hits 
is expressed as 

define VolByHourBySubject.distl formula decimal <hour, subject> 

EQ total(EXPvolume.high, hour, subject) / total(EXPvolume. high) 

Further, conditional probability distributions over the hits per subject is expressed as 
define VolByHourBySubject.dist2 formula decimal <hour, subject> 

EQ total(EXPvolume.high, hour, subject)/ total(EXPvolume.high, subject) 

In the actual implementation, some of the above cubes are materialized for com- 
putation efficiency. However, for consistency, it is only necessary to store volume 
cubes persistently in the data-warehouse. Derived patterns, either materialized or 
not, can be generated at analysis time. 



4.2 Multi-level and Multidimensional Feature Ranking 

Feature ranking, such as the top 10 Web sites being accessed, is important for such 
applications as targeted advertising. Using OLAP can enable ranking of Web access 
along multiple dimensions and at multiple levels. For example, given a Web site, e.g. 
WWW. connect . com, one may be interested in ranking the hits to it by companies, 
areas and time. 

Given a volume cube, ranking on a feature (dimension) is represented by a pair of 
cubes, one for a ranked list of elements of that dimension, and the other for the cor- 
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responding volume or probability measures from wbicb the ranking was computed. 
Consider the cube, Expvolume.top dimensioned by hour, toporigin, toprefsite, top- 
subject. The ranking of top N subjects dimensioned by origin sites, referring sites, 
and hour, is represented by the following pair of cubes 

define subjectjp.list variable text <order hour toprefsite toporigin> 
define subject_tp.perc variable dec <order hour toprefsite toporigin> 
where ‘brder”is a dimension containing ranks 1, 2, , N 

In general, the multidimensional ranking information for a feature X is kept in a 
pair of ranking cubes dimensioned by O, Ai, A 2 ... A. say \P, Ai, A 2 ... A] and 
R’fiO, Ai, A 2 ... A]- The ranked elements of X are kept as cell values of R^ , and the 
corresponding measure values (volume or percentage) are kept as cell values of Rf. 
O is the dimension for ordered numbers 1,2, ... , N. Typically these two cubes are 
computed from a measure cube with X and other dimensions related to Aj, A 2 ... A, 
denoted by C [X, A ’j, A ’ 2 ... A . The general algorithm is shown below. 

In nested loops, focus on each subcube ofRfiO, A 7 , A 2 ... A] on dimensions A;, A 2 
... A, say RfiAi-ai, A 2 -a 2 ,... A-afi, that is dimensioned by O, denoted R^fO]. 
Map RxslO] to a subcube of C, say, C„ that is dimensioned by all dimensions except 
X. Generate a sorted list of X elements based on the measure values of , and assign 
them to Rxs[0 ]. The corresponding measures (in volume or percentage) are assigned 
to the counterpart sub-cube of Rf. 



4.3 Correlation Analysis 

An important aspect of Web access analysis is to understand the correlation between 
different factors such as between origin sites and subjects. Such correlations can be 
represented as association rules. We have described in [10] an approach to use cube 
operations to mine association mles, and we proposed several extensions to conven- 
tional association rules, including scoped, multilevel, multidimensional rules. We 
show how to apply these rules to web access analysis, and we describe further exten- 
sions to rules with flexible bases and time-variant mles. 

4.3.1 Multilevel Assoeiation Rules with Flexible Base and Dimensions 

Association mles provide a quantitative measurement of the correlation between 
facts [ 6 ]. For example, if 50% of the origin sites for accesses to pages belonging to 
some specific subject are via referring site www.yahoo.com, and only 10 % of all 
these origin sites use Yahoo as a referring site, we say that the association rule has 
confidence 50% and support 10%. Given minimum support and confidence thresh- 
olds, a rule is considered strong if it satisfies these thresholds. 

An association mle has an underlying ba.se B that defines the population over 
which the mle is defined. For example, the correlation between subjects (i.e. target 
sites) and referring sites can be based on accesses, as 

xeWLRs: contain_subject(x, S) => contain_refsite(x, R), 
or based on origin site, as 

x^ origins: access _subject(x, S) => via_refsite(x, R), 
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regardless of whether the navigation occurs in the same session or not. In this exa m- 
ple, the association rule uses binary predicates with the first place denoting a base 
element and the second place denoting an item. 

In [6], we showed how to represent multidimensional and multilevel association 
rules using cubes. For example, the cube 
Cv [time, origin, refsite, subject] 

contains sufficient information for deriving association rules between referring sites 
and subjects (target sites). For example, we can define rules dimensioned by, and at 
different levels of time and origin_area, such as 

[ xe origins: access_suhject(x, S) => via_refsite(x, R)] \ 
time = Jan99,’ origin_area = CA’ 

[ xe origins: access_suhject(x, S) => via_refsite(x, R)] \ 
time = Year99,’ origin_area = USA’ 

The above data cube also contains sufficient information for deriving rules that 
express the correlation between subjects, e.g. 

[ xe origins: access_suhject(x. A) => access _subject(x, B)] \ 
time = 01Oct99,’ origin_area = UK’ 

In [6], we gave an algorithm that, starting from a given volume cube such as C 
(time, origin, refsite, subject), first computes a base cube Cb (refsite, origin_area), a 
population cube Cp (subject, refsite, origin_area), and an association cube Ca (sub- 
ject, subject2, refsite, time, origin_area); and then uses these cubes to derive support 
and confidence cubes. Note that the association cube includes a new dimension sub- 
ject!, which has the same elements as subject, and its measure is the count of base 
elements corresponding to each combination of subject and subject!. We omit the 
details from this paper. 

4.3.2 Time-variant Association Rules 

In the above association rules, only the elements of the time dimension are consid- 
ered. In reality, rules with respect to time-variant predicates may be more interesting, 
such as a rule that relates accesses (based on origin sites) to subjects A and B within 
the same day, 

I xeorigins: access_subject(x. A) access_subject(x, B)l \ time = ‘sameday,’... 
This rule concerns a predicate over the time dimension, which we model as a special 
dimension called time-slot. 

The volume cube for computing association rules dimensioned by generic time- 
slots (e.g. same-day, same-week,... ) is the same as defined above. The association, 
support, and confidence cubes are dimensioned by time-slot. Note that there is no 
need to dimension the population cube and base cube by time-slot since they are the 
same wrt all time-slot elements. The definitions of these cubes are shown below. 

□ association cube: C a (subject, subject!, refsite, time-slot, origin_area) 

□ population cube: Cp (subject, refsite, origin_area) 

□ base cube: Cb (refsite, origin_area) 

□ confidence cube: Cf (subject, subject!, refsite, time-slot, origin_area) 
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□ support cube: Cs (subject, suhject2, refsite, time-slot, origin_area) 

The computation of a rule dimensioned by generic time-slots differs from the 
computation of a rule dimensioned by time instants in the following aspects. 

□ Time bins: for rules dimensioned by generic time-slots, tbe time bins are not 
particular time elements such as hours or days, but rather time predicates. For 
instance, from an origin site, the accesses to subjects A and B in any week are 
mapped to tbe time slot element ‘^ame-week”. 

□ Duplicate elimination: the base elements of mles, e.g. origin sites, are not re- 
peatedly counted for a generic time-slot. For example, an origin site from which 
subjects A and B are accessed multiple times within the same day, or within the 
same day on multiple days, only contributes one count to same-day access. 

□ The handling of population and support cubes is also different, since they have 
no time-related dimensions as described above. 

Accordingly, the algorithm for mining association rules dimensioned by generic 
time-slots includes tbe following additional or different steps from the algorithm 
described in [6]. 

□ For each generic time slot (e.g., same-day), limit time instance accordingly (limit 
time to all days, excluding weeks, etc). 

□ The population cube Cp is instantiated with the dimensioned total counts of ori- 
gin sites in each origin_area, wrt subject, refsite, origin and based on the ante- 
cedent condition C„ (subject A) > 0. The base cube Ct is instantiated with the 
dimensioned total counts of origin sites wrt refsite and origin_area. 

□ In calculating the association cube Ca wrt each pair of subject, subject! , instead 
of counting the total origin sites that satisfy the association condition C„ (subject 
A) > 0 and C„ (subject! B) > 0, for each origin site (in a loop), check whether it 
satisfies that condition in any time instance belong to that time slot (e.g. any 
day), and if it does, count tbe origin site once only. 

□ Under the new definitions of these cubes, confidence cube and support cube are 

still computed by the cell- wise operations Cf- CJ Cp and - C a/ Ct 



5 Conclusions 

Collecting and mining web log records (WLRs) from e-commerce web sites has 
become increasingly important for targeted marketing, promotions, and traffic anal y- 
sis. We recognize that Web log mining solutions must scale to meet the requirements 
of huge data volumes and data flow rates encountered in these applications. We have 
identified several important scalability issues, and have developed an OLAP and data 
warehousing-based architecture that addresses these issues. We have applied several 
non-traditional approaches, allowing us to speed up WLR analysis by 3 orders of 
magnitude. 

In addition to the typical OLAP operations, our framework also supports data 
mining operations such as multilevel and multidimensional association rules. We 
extended the class of association mles to include rules with flexible bases and time- 
variant rules. Finally, it is important in many applications to collect and analyze 
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Web logs continuously, rather than in batches. To this end, we have automated the 
whole operation chain, including data filtering, loading, incremental summarization 
and analysis. 

Unlike most OLAP-based applications that treat OLAP servers purely as front-end 
analysis tools, we use the OLAP server as a computation engine, and support infor- 
mation staging between the data warehouse and the multi-dimensional database 
managed by the OLAP server. Further support for scalability can be provided 
through parallelism and distributed OLAP [6]. 

The infrastructure described in this paper has been fully implemented at HP Labs. 
Our results have validated the scalability and maintainability of this infrastructure. 
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Abstract. This paper discusses the PROMISE (Predicting User Behavior in 
Multidimensional Information System Environments) approach, that deploys 
information about characteristic patterns in the user’ s multidimensional data ac- 
cess in order to improve caching algorithms of OLAP systems. The paper moti- 
vates this approach by presenting results of an analysis of the user behavior in a 
real-world OLAP environment. Further contributions of this paper are a model 
for characteristic OLAP query patterns based on Markov Models and a corre- 
sponding OLAP query prediction algorithm. 



1 Introduction 

Online Analytical Processing (OLAP) systems offer capabilities to interactively for- 
mulate queries by iteratively applying multidimensional operations (e.g. slicing, 
drilling). Due to the importance of interactive response times in these environments, 
caching techniques for OLAP systems have received the attention of researchers ([1], 
[4], [11]). Additionally, it has been recognized ([8]), that the workload of an OLAP 
application is characterized hy the user's explorative and navigational data analysis 
task. A session typically starts with a query (mostly a predefined report) and the user 
then successively manipulates the results applying multidimensional operations (each 
of which results in a new query against the database). As a consequence, the OLAP 
workload shows specific high-level patterns that stem from the structure of the ana- 
lytical task the user is solving. 

The idea of the PROMISE (Predicting User Behavior in Multidimensional Infor- 
mation System Environments) approach is to provide the OLAP cache manager with 
information about these high level access patterns in order to make predictions. An 
algorithm that (at a point during an OLAP session) efficiently computes a set of que- 
ries and the corresponding possibilities for these queries to be executed in the near 
future can enhance caching techniques in two ways: 

Admission and eviction algorithms both need functions to estimate the benefits of a 
cached object for future queries ([1 1]). At this point, an accurate prediction of future 
queries and their probability can be used to enhance the computation of the future 
benefit of a cached object. 



' An extended version of this paper can be found in [9] 
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Additionally, Prefetching strategies can complement the demand fetching strate- 
gies currently deployed for OLAP caches. Making use of predictions the cache man- 
ager can speculatively fetch objects into the cache (during idle times) that are likely to 
be beneficial for future queries. This results in a reduction of the latency time per- 
ceived by the user when the prefetched data is actually accessed. 

As the example scenario for the rest of this paper, let us consider a logistics plan- 
ning system used to plan the supply of spare parts for different kinds of vehicles 
(automobiles, trucks etc.). A data warehouse contains data about which vehicles are 
deployed in which regions, information about repairs of vehicles (including which 
parts were exchanged, what type of failure occurred, the geographical location of the 
repair). Figure 1 depicts an extract of the multidimensional schema for such an appli- 
cation. A typical task of a logistic planer is to determine a distribution and stock- 
keeping strategy for a certain geographic region or repair location (e.g. how many 
parts are kept locally to satisfy anticipated demands). While this task is far from being 
trivial and cannot be easily automated, the analysis process follows a methodological 
stmcture of subtasks. 
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Fig. 1. The schema for the example scenario (depicted using the ME/R notation [10]) 



2 Related Work 

To the best of our knowledge no work on predictive prefetching in OLAP environ- 
ments has been done so far. Nevertheless, predictive prefetching techniques have been 
extensively researched in the areas of branch prediction for pipelined microproces- 
sors, operating systems in order to do intelligent prefetching of files (e.g. [5]), distrib- 
uted hypertext applications (e.g. WWW) for presending documents and proxy cache 
design (e.g. [2], [7]) and for 00 databases (e.g. [3]) to prefetch objects. 

Directly transferring these approaches to the OLAP domain, where the entity of 
prediction is a multidimensional query, shows two fundamental differences of OLAP 
systems compared to the other application areas: First, the number of prediction enti- 
ties (queries) is considerably larger than in other domains (e.g. the number of files in a 
system). Second, different cached objects (i.e. queries) are not disjoint in the sense. 
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that results of a new query can be computed using parts of several cached objects. 
That means, a prediction is already useful if a query is being predicted from which the 
exact query can be derived. Another distinctive feature of PROMISE compared to all 
the presented approaches is the usage of patterns on a level of abstraction higher than 
the actual granularity of access (e.g Web documents, Pages, Objects, Files) - see 
section 5. 

All of the current OLAP caching approaches ([1], [4], [11]) assume a demand 
fetching strategy, therefore, we view our work complementary to these caching algo- 
rithms. 



3 Empirical Analysis of OLAP System Workloads 

In order for predictive prefetching techniques to be successful, the workload has to be 
navigational and the consideration time between two accesses must be long enough to 
allow for prefetching the results. The real-world system that has been investigated 
regarding these criteria is an SAP BW system supporting the distribution logistics of 
the material management division in a large chemical company. The interaction be- 
havior of 18 users was monitored over a two month period including 260 sessions 
containing 3150 queries. 




Fig. 2. Cumulative distributions of session length and consideration time 

In OLAP systems, the navigational nature of the workload is guaranteed as long as 
the user interactively formulates his next request using the results of the previous 
request ([9]). We call such a sequence of navigational queries a session. The analysis 
showed that typical OLAP sessions have a considerable length and are thus suited for 
prediction approaches. The left hand side of Figure 2 shows the cumulative frequency 
distribution of the session length. It is obvious, that only 11% of the sessions con- 
sisted of executing a single query (simple reporting). On the other hand, some of the 
sessions contained more than 100 queries. If we assume that accurate prediction is 
possible for sessions with 5 or more consecutive queries. Figure 2 shows that 63,8% 
of the sessions fulfill this condition. 

Additionally, the analysis showed that the consideration time between two queries 
is long enough for a significant percentage of queries. The right-hand side of Figure 2 
plots the cimiulative frequency distribution of the consideration time against a loga- 
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rithmic scale. If we assume that the prediction and execution of a query takes 10 sec- 
onds (The median for query execution times is 9,4 seconds), over 81% of the queries 
have a consideration time that is long enough for prefetching. These results are con- 
firmed when directly comparing consideration time and execution time for individual 
queries: 7% of the queries in our scenario cannot benefit from prefetching because 
they start a session. For 8% of the queries, the execution time of the next query is 
larger than the consideration time of the user, forbidding a full prefetch (rapid que- 
ries). The remaining 85% of the queries analyzed, could potentially fully benefit from 
predictive prefetching. 

4 Terminology and Basic Concepts for PROMISE 

This section introduces Markov Models (e.g. [6]) as a formalism to model interactive 
behavior. It also motivates which steps are necessary to adapt Markov Models to the 
OLAP domain. Furthermore, it presents the formal model for MD schemata and pres- 
ents the class of multidimensional queries considered for the PROMISE approach. 

Discrete time Markov Models (DTMM) represent a system as a finite set of states 
S={si,...,Sn}. At each discrete point t of the (logical) timescale exactly one of the 
states s'eS is active. Between two points in time (t-1) and t, the system changes its 
active state from s^'^e S to s'e S. The probability that a specific state .v,- becomes the 
next active state is determined by a transition probability function p. The transition 
probability is only dependent on a fixed number of states m (called the order of the 
DTMM) that have been visited directly before the point t (i.e. the states s‘’\ s‘ “, s‘ “). 

Definition (Order-m Markov Model): An order m Markov Model is a tuple (S, P) 
where S denotes a finite number of states and P: SxS“->[0;l] denotes the probability 
function for state transitions. ♦ 

Designing a basic prediction algorithm based on a DTMM is rather straightforward. 
During a session, the vector v=(s' “, ...,s‘"') containing the last m states that have been 
visited has to be recorded. The transition function P(s‘,v) can then be used to calculate 
the most probable states to be visited next. Normally each state transition corresponds 
to an interaction (e.g. the activation of a WWW link). When applying this to the 
OLAP domain, each state transition would be the execution of a multidimensional 
query as the result of applying an MD operation. This naive approach has severe 
technical and conceptual drawbacks: Large State Space: Even if we restrict the class 
of queries, the number of queries that can be executed against an MD schema is still 
orders of magnitude larger than e.g. the number of documents of a WWW site. A 
large number of states leads to a large set of potential successor states which in turn 
decreases the probability of the transitions. Therefore, this model is not only very 
costly to maintain but also not very predictive. Higher Level Patterns cannot be ex- 
ploited: The fine grained approach only predicts mere repetitive patterns that consist 
of exactly the same queries. This is unlikely for OLAP applications where e.g. the 
same sequence of query templates are being executed with different parameters. 

Section 6 will show how our approach tackles these problems. But first we have to 
give a definition of the multidimensional data model which we need later for the defi- 
nition of a canonical query. We only describes the most important elements of the 
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MD data model which are necessary to understand the prediction algorithm. A more 
comprehensive version of this data model can he found in [8], 

Deflnition (Dimension Schema): A dimension schema d is defined as tuple d := (Lj, 
class) with being the set of level names. Each level leL^ has a finite domain dom(l) 
attached. Furthermore, every dimension schema contains a special level <dimen- 
sion_name>.all which contains only the special member <all>. The relationship class 
c L^iX defines the classification relationship between levels. (Ii,l 2 ) e class* reads 
“li can be classified according to I 2 .” ♦ 

Example: The ‘location’ dimension of our example (cf. Fig. 2) can be formalized as 
follows diocation:= ( { location, type_of_unit, geogr_region, country, location. all}, 

{(location, type_of_unit), (location, geogr_region), (geogr_region, country), 
(country, location.all), (type_of_unit, location. all) } ) ♦ 

The basic organizational element of an MD database is a multidimensional cube 
which contains the interesting measures. The schema of an n-dimensional cube is 
defined by combining n dimension schemata with a set of measures. 

Definition (Multidimensional Cube Schema): An n-dimensional cube schema Q. is 
a tuple Q. = (Da, Mq) where =(di,..d„) is a list of dimension schemata and Mq is 
the set of measures of the cube. ♦ 

For notational convenience, we use the following conventions: dom(d) denotes the 
domain of a dimension deD. All elements xedom(d) are called dimension members 
of dimension d. Furthermore, 'P denotes the set of all levels of a cube schema Q.. 
Thus, 

domid)- \^dom(l) and T'= 

d€iQ. 

The function level(x) returns the dimension level to which a dimension member be- 
longs, i.e.: level(x) =l xe dom{l),lE Q.,xe [Jdom(d). 

d&D 

The schema of a multidimensional database determines the queries that can be exe- 
cuted by the user. Our approach relies on a canonical representation of an OFAP 
query. Such a query is defined by the user selecting a set of interesting measures first 
(called result measures). Then, for each dimension of the cube schema O, the user 
picks a restriction element (which may be the aW-element representing no restriction). 
Implicitly, this defines a restriction on the base elements of the dimension. Figure 3 
visualizes the selection process for a two dimensional data space. In dimension I 
(left), the restriction element is on level Fij of the dimension hierarchy. The structure 
of the result is defined by giving a result granularity for each dimension. This implic- 
itly defines the necessary aggregations that have to be performed to the base data 
stored in the cube. For example, in Figure 3, for dimension 1 (left) the result granu- 
larity is Fi 2 . The result of the query is shown on the right hand side of Figure 3. 

Deflnition (Canonical OLAP Query): 

A canonical OLAP query q^ over an n-dimensional cube schema Q. =((di,..,d„), Ma) is 
defined as a 3-tuple q^j := (M, R, G) where M cM^is the result measure set and. 
i? = (q r 2 ,...,r„) with r^ e dom(di),l<i < n is called the restriction vector of q^j. 
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'fhe element r, is called the restriction element for dimension d,. The set 
C = (gj g,i ) with g, e 'P is called the result granularity vector of qn- In 

order for to be valid, the restriction must not be finer than the result granularity, 
thus ilevel{rj ), g; ) e class VI < i < n . For a cube schema Q we denote the set of all 
possible canonical queries as ©q * 

L, , L,, L, 



Query 

Result 



Granularityg, OOt##fOOOO 

Fig. 3. Visualization of a canonical query on a two dimensional cube schema 

Example; An example query against the schema n^pair might be “Give me the num- 
ber of repairs for the year ‘1999’ and the ‘steering’ assembly split up by the geo- 
graphic regions of ‘Germany’”. 

q,:= C#repairs', ( ‘Germany’, ‘1999’, ‘all’, ‘all’, ’steering’ ), 

( geogr. region, year, type, all, vehicle. all, assembly ) ) * 

5 Abstracting to Patterns 

This section presents different abstractions of canonical queries and argues why these 
are likely to produce more significant patterns for interactive data analysis applica- 
tions. Mathematically speaking, the ab.straction is defined by an equivalency relation 
a c 0 qX0q on the set of all canonical queries 0fj. (We denote the equivalency class 
of each query qe ©n as p (q)). For PROMISE we distinguish structural and value- 
based patterns. 

Structural patterns mirror the fact that each stage of the analysis process has a set 
of characteristic views on the data. E.g. during the first phase of the analysis the lo- 
gistics expert is looking for irregularities of failures according to different geographi- 
cal regions. This means he compares the failure rate of different assemblies in the 
different regions. Thus, the current interest of the analyst can be derived from the 
structure of the query. First, it is important to note on which level of granularity the 
user formulates his queries (i.e. the granularity of the result G). A second hint to the 
users intention is his choice which dimensions are result dimensions (visualized e.g. 
as a table) and which dimensions are restricted to a single value (selection dimen- 
sion). The rationale behind this assumption is that if the user is analyzing dependen- 
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cies between dimensions (e.g. geographic region and broken parts), the user will 
choose theses dimensions as result dimensions. 



Deflnition (Selection Dimension): 

For a canonical OLAP query qn= (M, (ri,...,rn), (g],...,gn)) against a cube schema 
i2=((di,...,dn), Mq), we define the set of selection dimensions (Tg as fol- 
lows: :={dj\level(rj) = g^] . A dimension which is not a selection dimension is 

called result dimension. * 



Following the above description of structural patterns, we define an abstraction of a 
query (called Structural Query Prototype) that only considers the measures contained 
in a query, the granularity of the result and which of the dimensions are selection 
dimensions and which are result dimensions. This prototype abstracts from the actual 
values used to restrict the query. 

Deflnition (Structural Query Prototype): 

Let us assume a query q^j = (M, R, G) with i^=((di,...,d„), Mq) and its set of selection 
dimensions The structural query prototype pq for this query is defined as follows: 
p{q) :=(m,G,L = (Li,...,L„)) with L,- =lif d,- € cr^ and L,- =0if d,- g cr^ 

We denote the set of all structural query prototypes for an MD schema Q.as pa- * 



Example: The query qi = “Give me a ranking of 
garages in Germany according to the number of 
repairs summed up to the geographic region for 
1998” and the query qi = “Give me a ranking of 
all garages according to the number of repairs 
summed up to the geographic region for 2000” 






result measures 
# of repairs 



granularity of 
selection dimensions 



granularity of 
result dimensions 



Fig. 4. Query Prototype 



both have the same prototype: p^i= p, 2 = 

({#repairs}, (geogr.region, year, vehicle. all, type. all, part.all), (0,1, 1,1,1)} ) 



♦ 



Figure 4 shows a graphical representation of a structural query prototype for this 
example. With a view to greater clarity, we omit the ‘all’-levels from the enumeration 
of selection levels. 

Value-based patterns model the fact that the analyst often accesses values of a di- 
mension in a certain order which may be problem specific. E.g. the analyst may start 
the analysis with his own region and then analyze the adjoining regions. In order to 
describe these value-based patterns between successive queries, we define a value- 
based equivalency relation. As these patterns are specific for a single dimension, the 
equivalency relation is defined dimension specific: 



Definition (Vaiue Based Abstraction): 

Assuming two queries = (M,{r^,..., }, G) and , ^ = (M,{r(,..., r„ },G) , the 

value-based equivalency relation av‘ is defined as: ay r^ =7^ * 



6 The PROMISE Prediction Algorithm 

The previous section presented different abstract views on the interaction sequences. 
The purpose of the abstraction was to suppress detail information about a query in 
order to produce more significant, higher level patterns. The problem for prediction is 
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that a single prediction model which is built only on the abstract view can only make 
abstract predictions. This section shows how to overcome this problem by using a set 
of prediction models and exploiting the similarity of two subsequent queries. 

Summarizing our considerations presented so far, we base the prediction model on 
the following central assumptions: 

1 . The structural patterns are independent of the restriction values (i.e. the probability 
that the next query has a certain structural prototype, is not dependent on the cur- 
rent restriction elements). 

2. Patterns for the restriction values in the different dimensions are independent of 
each other (i.e. the probability of selecting e.g. ‘1999’ next is not dependent on the 
current restriction e.g. in the location dimension). 

3. The probability which of the restriction elements change between two queries is 
dependent on the structural protoypes of the two queries. 

Assumption (1) and (2) allow us to use (n+1) independent Markov Models (one 
Structural Model and one Value-Based Model for each dimension) for the prediction. 
Assumption (3) is a refinement of the separation to increase the accuracy of the pre- 
diction (see below). 




Fig. 5. A sample Structural Prediction Model (SPM) 

The state space of the Structural Prediction Model (SPM) is defined by the set of 
structural prototypes of a model. A transition from state pi to p 2 is labeled with the 
probability that a query qi with prototype pi=p(qi) is followed by a query q 2 with 
prototype p 2 =P(q 2 )- Figure 5 shows an extract of the structural prediction model for 
the example schema Qe- The SPM mirrors the macro structure of the analysis process. 
It can be used to predict the selection levels, the result granularity and the result 
measures of the next query. As it does not contain any information about values, it 
cannot predict the restriction elements. 

For the prediction of restriction values, we introduce a Value-Based Prediction 
Model (VPM) for each dimension. This model contains information about typical 
navigation pathes in a dimension. The states are labeled with the dimension members 
and transitions from state a to state b represent a sequence of queries q,p where q was 
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restricted by a and p was restricted by b. Figure 6 (left) shows a sample VPM for the 
‘location’ dimension. As stated in assumption (1), the VPM is independent of the 
SPM and is used to predict restriction values for the next query. 




Fig. 6. Examples for a VPM (left) and a probabilistic change vector (right) 

So far, we can already predict a complete next query using the combination of SPM 
(result measures, result granularity ) and VPM (restriction elements). The accuracy of 
the prediction can be increased by taking into account assumption (3). It is based on 
the observation, that typically not all of the restriction values are changed between 
two subsequent queries. Which dimensions change restriction elements is often de- 
pendent on the current state of the analysis process (i.e. the structure of the last 
query). Therefore, we extend the SPM to additionally record the probability for the 
different restriction values to change (called probabilistic change vector) for every 
transition. For example, Figure 6 (right) shows the probabilistic change vector for a 
transition of the sample SPM. The probability that the restriction element for dimen- 
sion 1 (location) changes between two subsequent queries with a structure corre- 
sponding to state (1) is 10%. This information is used to further restrict the set of 
candidate restriction values for the dimension in the following way: If the probabilis- 
tic change vector indicates a probability that is below a certain threshold (e.g. 15%), it 
is assumed that the restriction value for this dimension does not change. The follow- 
ing formal definition summarizes the above considerations: 

Dellnition (Prediction Profile): A Prediction Profile P^j for an n-dimensional cube 
schema Q is a 3-tuple (SPM, (D, VPM) with SPM being a Markov Model containing 
the structural prototypes of the cube schema as states. SPM = (({ p (q) I qe 0^}), Ps) 
is called the structural prediction model. The function O7:PqX§Jq-^[0,\]'' assigns a 
probabilistic change vector to every transition of the structural prediction model SPM. 
VPM = (vpmi,...,vpm„) is a set of n Markov Models. Each vpm;= ({rlre dom(di)} , PvO; 
ie [1 ;n] is a Markov Model based on the values of dimension d; and is called the value 
prediction model for dimension d;. ♦ 

Prediction Profiles can either be built by domain experts during a conceptual model- 
ing process ([8]) or can be induced from query log files that contain the queries that 
were executed in the past. The prediction algorithm uses such a prediction profile in 
the following way: We assume that the system keeps track of the last query of the 
user’s session q‘"'=(M‘ \ R‘ \ G* '). The algorithm computes a set of candidates for the 
query q''^=(M‘, R‘, G‘). 
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First, we use the SPM together with to predict the granularity vector G' and the 
result measures M‘ of the next query. In order to do sensible prefetching, we addition- 
ally need to predict the selection vectors R‘. The prediction of R‘ consists of two 
phases. For the selection dimensions of q‘, r- can be derived from the probabilistic 
change vector co( p (q‘ *), ^ (q‘)) in the following way: if the probability for a value 
change in the i* dimension (®,)is below a certain threshold, is assumed to be equal 
to st^. If (Oi is higher, the prediction model VPMi is being used to predict the next 
selection value. If d; is predicted to be a selection dimension in q‘, we additionally 
know, which granularity level, the next restriction element will belong to as the level 
of selection is equal to the granularity of the result g;‘. This additional information is 
used to provide a better prediction as the potential successor states in VPMi are re- 
stricted to values of dom(gi). 

Example: May query q‘ * be the query shown in the example in section 4.3 (this 
query corresponds to state (1) in Figure 5). The prototype predicted by the SPM (with 
a confidence of 40%) is p (q‘)=({#repairs}, (geogr_region, year, vehicle. all, type.all, 
assembly), (0,1, 1,1,1) ) which also corresponds to state (1) of the SPM. If we assume 
a value change threshold of 0.15, the restriction elements of dimensions 1 (location), 
3 (vehicle) and 4 (type) are assumed not to change (cf. Figure 6) while the VPM’s are 
consulted for the selection values S2' (time) and S5' (part). Being selection dimensions, 
in both cases it is known that S2* belongs to the level year and S5' belongs to assembly. 

References 



[1] J. Albrecht, A. Bauer, O. Deyerling, H. Gtinzel, W. Hummer, W. Lehner, L. Schlesinger: 
Management of Multidimensional Aggregates for Efficient Online Analytical Processing. 
Proc. of the IDEAS, 1999. 

[2] A. Bestavros: Speculative Data Di.t,semination and Service, in Proc. of the 19“' International 
Conference on Data Engineering (ICDE), 1996. 

[3] K.M. Curewitz, P. Krishnan, J.S. Vitter: Practical Prefetching via Data Compression, in 
Proc. of the SIGMOD ‘93, ACM Press, 1993 

[4] P. Deshpande, K. Ramasamy, A. Shukla, J.F. Naughton: Caching Multidimensional Queries 
Using Chunks. In Proc. ACM SIGMOD Conference, pp. 259-270, 1998. 

[5] J. Griffioen, R. Appleton: The Design, Implementation, and Evaluation of a Predictive 
Caching File System, Technical Report No. CS-264-96, University of Kentucky, 1996. 

[6] R.A. Howard: Dynamic Programming and Markov Processes, John Wiley, 1960. 

[7] V. N. Padmanabhan, J. C. Mogul: Using Predictive Prefetching to Improve World Wide Web 
Latency. In Proc. of ACM SIGComm ’96, pp.26-36, 1996. 

[8] C. Sapia On Modeling and Predicting User Behavior in OLAP Systems, Proceedings of the 
DMDW99 Workshop, CAiSE Conference, 1999. 

[9] C. Sapia PROMISE - Modeling and Predicting User Query Behavior in Online Analytical 
Processing Environments, FORWISS Technical Report FR- 2000-001, June 2000. 
http://www.forwiss.de/public/reports.html. 

[10] C. Sapia, M. Blaschka, G. Hofling, B. Dinter, Extending the E/R Model for the Multidi- 
mensional Paradigm, in "Advances in Database Technologies", LNCS Vol. 1552, 1999. 

[11] J. Shim, P. Scheuermann, R. Vingralek, Dynamic Caching of Query Re.sults for Decision 
Support Systems, in Proc. of the SSDBM Conference, 1999. 




Supporting Online Queries in ROLAP * 



Daniel Barbara Xintao Wu 



George Mason University 

Information and Software Engineering Department 
Fairfax, VA 22303 
{dbarbara , xwu}@gmu . edu 



Abstract. Data warehouses are becoming a powerful tool to analyze en- 
terprise data. A critical demand imposed by the users of data warehouses 
is that the time to get an answer (latency) after posing a query is to be as 
short as possible. It is arguable that a quick, albeit approximate, answer 
that can be refined over time is much better than a perfect answer for 
which a user has to wait a long time. In this paper we addressed the issue 
of online support for data warehouse queries, meaning the ability to re- 
duce the latency of the answer at the expense of having an approximate 
answer that can be refined as the user is looking at it. Previous work 
has address the online support by using sampling techniques. We argue 
that a better way is to preclassify the cells of the data cube into error 
bins and bring the target data for a query in “waves,” i.e., by fetching 
the data in those bins one after the other. The cells are classified into 
bins by means of the usage of a data model (e.g., linear regression, log- 
linear models) that allows the system to obtain an approximate value for 
each of the data cube cells. The difference between the estimated value 
and the true value is the estimation error, and its magnitude determines 
to which bin the cell belongs. The estimated value given by the model 
serves to give a very quick, yet approximate answer, that will be refined 
online by bringing cells from the error bins. Experiments show that this 
technique is a good way to support online aggregation. 
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1 Introduction 

A data cube is a popular organization for summary data [12]. A cube is simply 
a multidimensional structure that contains at each point an aggregate value, 
i.e., the result of applying an aggregate function to an underlying relation. For 
instance, a cube can summarize sales data for a corporation, with dimensions 
“time of sale,” “location of sale” and “product type” . 

The underlying relation that serves as the base for the cube creation is com- 
monly referred to as fact table in the star schema [16]. The fact table contains 
all the attributes that determine the dimension of the cube, plus an attribute 
on which the aggregations are to be performed. Along with this table, there are 
dimension tables whose keys are foreign keys of the fact table and which de- 
scribe each one of the dimension attributes. The combination of fact table and 
dimension tables form what is called a star schema. In our example, the fact 
table would contain the individual sale tuples, stating the time of sale, location, 
product and amount. The last attribute (amount) is the one on which the ag- 
gregations are to be performed (to find out, for instance, the total sales amount 
for a particular location, during the month of January, 2000). 

A data cube is composed by a series of aggregation levels that form a lattice, 
(see [12, 13]). The basic level, core cuboid, contains aggregations at the finer level 
(i.e., where each dimension of the cube is represented by a value). For instance, 
in our example the core cuboid would contain one cell for each combination 
of time, location and product type. The value contained in that cell would be 
the aggregate of each individual tuple in the fact table whose attributes time, 
location, and product type are precisely those of that cell. 

Each tuple in the fact table contributes to one cell of the core cuboid. How- 
ever, not all cells in the core cuboid have corresponding tuples in the fact table. 
There exist (usually many) cells whose values are null, simply because there is 
no tuple in the fact table that exhibits their combination of attributes. In our 
example, there will be combinations of time, location and product type for which 
no tuple exists, and therefore the corresponding cell value is null. Coarser aggre- 
gated cells (e.g., time, and location for all products), form other cuboids in the 
data cube. 

Multidimensional data bases are a central concept in the field commonly 
known as On-Line Analytical Processing (OLAP). OLAP systems are typically 
implemented either directly on top of relational systems or as a combination of a 
relational system and a proprietary multi-dimensional database (MOLAP) [11]. 
In both approaches the underlying database is stored in a relational system. In 
a fully relational implementation (ROLAP), data cube queries are transparently 
converted into queries on the underlying relational database. Clever indexing 
schemes and careful query optimization are used to improve the performance 
of these queries. In the MOLAP approach, the cube data is extracted from the 
database, converted into multidimensional arrays, and clustered so that common 
cube queries require minimal I/O. 

Queries (mostly aggregations) in OLAP are performed in a mode that mostly 
resembles a batch operation. A query is posed, the system processes it (usually 
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for a long period of time) and the answer is then returned to the user. This is 
particularly unfortunate in the OLAP environment, since OLAP is intended to 
serve as a decision support tool. The analyst is intent in finding facts about the 
organization, and more often than not, the results of the query drive the next 
query the analyst wants to ask. Waiting for a long time for an answer breaks 
the train of thought and is likely to render the system much less appealing for 
users. The analyst would be better served by a quick, yet approximate answer 
that gets refined as she is thinking about the significance of the data. 

To that purpose, Hellerstein, Haas and Wang [15] have proposed and imple- 
mented a technique that allows users to observe the progress of the query and 
obtain a quick, approximate answer that gets refined over time. At any point in 
time, an error estimate is offered to the user. The user controls an interface that 
permits stopping the processing of the query at any point where she is satisfied 
with the answer. The authors do this by performing a running aggregate, re- 
trieving records in random order and estimating the proximity of the aggregate 
to the final result as a running confidence interval. A follow-up paper [14] aims 
to optimize this technique by using ripple joins, designed to minimize the time 
until an acceptable precise estimate of the query result is available. 

There are, however, several problems with the approach taken by [15], as 
pointed out in [1]. The use of sampling to estimate the output of a join, can 
produce a poor quality approximation, due to the non-uniform result sample 
(the join of two uniform samples is not a uniform random sample of the output 
of the join), and the size of the join result (the join of two random samples has 
very few tuples, leading to inaccurate answers and poor confidence bounds). The 
authors of [1] propose to remedy these problems by using precomputed samples 
of a small set of distinguished joins that serve as a basis to answer queries. 
However, it is not clear how such a solution can be used to implement a system 
that refines the answer as time goes by. 

In this paper, we present an alternative method for supporting on-line ag- 
gregations in ROLAP. Our technique relies on the characterization of chunks of 
the data cube via statistical (or other type of) models. Partitioning the cube 
and fitting a model to each of the resulting chunks gives a way of estimating 
the values of the cells. Of course, these estimations are never perfect. However, 
some cells are better fitted by the model than others. Our technique classifies the 
tuples in the fact table in error bins, according to how far the estimates of their 
corresponding cells in the cube are from the real values of the tuples. Having 
done this, queries are processing first by using the models to estimate the values 
of the target cells and retrieving one after another, sets of target tuples in each 
error bin, beginning with the bin that corresponds to the highest error level, 
and following the bins in the descending error value. By doing so, the system 
can offer a quick, albeit approximate answer based in the estimates, which can 
be refined over time as tuples from the different error bins are processed. Since 
the errors in the bins have been pre-computed in advance, there is no need to 
estimate the confidence level. 

This paper is organized as follows. Section 2 describes our technique. Section 
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3 presents an architecture of the system. Section 4 shows the results of exper- 
imental evaluation. Finally Section 5 offers conclusions and avenues for future 
work. 



2 Our Technique 

The basis for our technique is the concept of the Quasi-Cube, formulated previ- 
ously by us in [6, 7, 9] as a way to achieve cube compression. Quasi-Cubes are 
the only one cube compression technique that allows the designer to establish 
guarantees for the errors incurred in the answers in such a way that the errors 
are kept below a predefined threshold, regardless of the distribution of the data. 
All the other techniques [6, 7, 9, 17, 19] can guarantee error levels, but the errors 
vary with the underlying data distributions and cannot be fixed by the designer. 

In the Quasi-Cube, we model regions of the core cuboid and employ these 
models to estimate the values of the individual cells. The reason to focus on the 
core cuboid is simple: the error guarantees for queries to the core cuboid hold for 
any other cuboid in the lattice. (In practice, the errors incurred when aggregating 
estimated cells of the core cuboid decrease dramatically because they tend to 
cancel each other.) We achieve fixed levels of error by retaining cells of the cube 
whose values, when recreated by the decompression algorithms differ from the 
real values in more than the tolerated threshold. 

In this paper, however, we use the Quasi-Cube idea for a different purpose. 
Once the modeling of the regions is finished, we classify and place each tuple in 
the fact table in an error bin. An error bin is a repository of tuples from the fact 
table that contribute to cells in the core cuboid whose estimated values differ 
from their real values by an amount that falls within the range, expressed as 
error percentage, assigned to that error bin. The whole set of error bins covers 
the entire range of error percentages, with the last one being associated with 
an open interval (greater than a certain error percentage value). The number of 
bins, as well as the error ranges associated with the bins are set by the designer 
before the tuples get placed in them. 

In what remains of this section, we explain the issues and techniques behind 
each step of our idea. 



2.1 Chunking and modeling 

The first step is to select chunks of the core cuboid that will be modeled. Of 
course, this assumes that we have computed the core cuboid, from the fact and 
dimension tables. There are efficient algorithms to compute data cubes (e.g., 
[18, 2]), however, since we are only interested in computing the core cuboid, a it 
is sufficient to sort the fact table, and do a pass over the sort data to compute 
aggregates. 

In dividing the core cuboid, the target is to find chunks that are dense enough, 
so the models have a better chance of fitting the data. A simple way of doing 
this is to divide the core cuboid space with a regular grid, identifying the chunks 
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that have potential for being modeled, either because they are dense enough or 
because sub-chunks contained inside them are dense enough. A pass over the core 
cuboid data is required to produce the chunks. (We are currently experimenting 
with techniques to “re-shuffle” the order of the rows in some (or all) dimensions 
to increase the density of regions.) 

At the end of this step, we have classified the chunks into three types: Dense 
chunks, which are candidates for modeling; Empty chunks, i.e., sub-spaces that 
do not contain any data cells and can be discarded; Sparse chunks, which contain 
very few cells. Only the dense chunks are modeled. Computing the models is a 
step whose implementation depends, of course, on the choice of models. Many 
different models can be used for cube compression [5] and can be used to im- 
plement on-line aggregation. Details of how to efficiently model chunks of data 
using linear regression can be found in [7]; for loglinear models, see [9]; for kernel 
estimations see [17], and for wavelets [19]. The choice of the model is orthogonal 
to our technique affecting only the population distribution of tuples among the 
error bins. 

To illustrate the meaning of models, we present a linear regression model 
[20] for two dimensions, shown in Equation 1. Each cell value yij, is shown as 
being composed by three terms, where Cj is the marginal sum for column j in 
and Vi is the marginal sum for row i. (The marginal sum is the sum of the values 
belonging to the particular row or column.) With this method, each cell value 
would be described by three values: bo,bi,b 2 ,. Notice that we build one of such 
models for each (dense) chunk in the core cuboid. 



Vij = boCj + biVi + 62 ( 1 ) 

The difference between the estimated value yij and the real value of the cell, 
Uij, gives us the estimation error, which can be expressed as a percentage of the 
value yij, and used to classify the cell to one of the error bins. 

It is worth to point out that the modeling of the chunks requires no aditional 
pass over the core cuboid, since the models can be computed as soon as the 
chunk is deemed dense enough, and while the data is still in main memory. This, 
of course, assumes that the cells in the chunk fit in memory. If that is not the 
case (i.e., we are in the prescence of a big chunk), then some I/O is required 
to drive the modeling process. Most likely, though, the modeling step will be 
CPU intensive, and its running time will be determined by the complexity of 
the models we are trying to build. 

The dense chunk descriptions are organized as blocks of data containing the 
model description, parameter values, and index (we index either non-empty cells 
or empty cells, depending on which set is smaller). These descriptions are smaller 
than the data they characterize and thus, many of them can be brought to main 
memory at once. Indexing of the chunks can help in quickly finding which chunks 
are needed to answer a query. 
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2.2 Placing tuples into error bins 

As the chunks are being modeled, the error between the estimated value of 
a cell and its real value can be computed. Thus, the cell can be classified as 
belonging to an error bin. (The error ranges of the error bins have been previously 
established.) However, since we do not put cells into error bins, but rather tuples 
(as we are interested in supporting online aggregation in ROLAP), we still have 
to find the tuples that contribute to that particular cell in the core cuboid 
(recall that each tuple in the fact table contributes only to one cell in the core 
cuboid) . We do that by querying the fact table to find the tuples that match the 
coordinates of the particular cell. Once these tuples are found, they are written 
in the table associated with the error bin to which the cell has been classified. 
(The error bin is implemented by a relational table that contains the tuples 
belonging to that bin.) Tuples corresponding to cells in the sparse chunks are 
placed directly into the highest-error bin. 

At the end of this process, we are left with a set of relational tables, each 
representing an error bin. (Each containing tuples that contribute to cells whose 
estimation error lies within the range of the corresponding bin.) 

It is important to provide a way to add new tuples to the fact table after the 
original set of tuples has been placed in the error bins. In our technique this can 
be acomplished in a simple manner, but we leave the details out of this paper 
for lack of space (they can be found in [9]). 



2.3 Query processing 

Given a query posed by a user, the evaluation plan is as follows. 

— First we identify the chunks that contain target cells for the query, and use 
the models to estimate the values of these cells. At the same time (concur- 
rently) we issue a request to the data base engine to fetch the target tuples 
in the first (highest-error) bin. The first estimate of the answer can be com- 
puted with the model estimates This is true for all cells with the exception 
of those that belong to sparse chunks whose tuples have been stored in the 
highest-error bin. This fact argues for making the density threshold to de- 
cide which chunks are dense enough as small as possible, however, we have 
to keep in mind that better models are found when the density is higher. 

— After the target tuples are retrieved from the highest-error bin, they are 
aggregated according to the query and combined with the previous answer to 
give a refined view of the answer to the user. Then more tuples are requested 
from the next (in error descendent order) bin, until either the user decides 
to stop the process or all the data has been processed. 

Notice that the query processing of the bins requires a data base engine, 
which could be propietary or an off-the-shelf, commercial data base. The bins 
are simply relations over which the query is posed. Figure 1 shows a diagram of 
the architecture of the system. 
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Figure 1: Architecture of the online aggregation system. 



3 Experimental Results 

Our results were obtained using a real data set which contains data from a 
survey of retailers. From each tuple of the dataset seven attributes were used 
(the rest of the data in the tuple is textual) . The first six are dimension attributes 
(such as product, store) and the seventh one is the aggregate attribute (i.e., sale 
amount). The corresponding cardinalities of the attributes are 597, 42, 966, 
1764, 4, 69. The size of each tuple is 102 bytes and the data set has a total 
of 764,993 tuples (i.e., 78,029,286 bytes). The complete cube for this data set 
would have 11,792,568,586,176 cells, but since the data is sparse only 335,324 
cells are non-zero. We set up six error bins whose error ranges were (in descending 
order): > 50%, [40%, 50%), [30%, 40%), [20%, 30%), [10%, 20%), < 10%. We used 
Oracle8 Personal Edition as the data base engine, and the experiment was ran 
on a machine with a Pentium CPU running at 300Mhz. The cube was divided 
in equally-sized chunks. All the chunks were considered dense (no sparse chunks 
kept) . 

Figure 2 shows the time needed to populate the error bins. Terror , using the 
retailer data set. This time includes building the core cuboid, dividing the cube 
in chunks, modeling the chunks and populating the tables. The third column list 
the time to build the core cuboid, T2.eare (without modeling regions). The first 
row in the table corresponds to the experiment we ran using the retailer data 
set. For the second row, we took only a subset of the data set, corresponding 
to the first value of attribute 1 and built a smaller Quasi-Cube (and cube) of 
the dimensions shown in the table. The overhead imposed by the regression 
method, although considerable (61:1) is manageable: a Quasi-Cube of the larger 
size (597 x 42 x 966 x 1764 x 4 x 69) takes approximately 1.5 hours to be built 
in its entirety. We also see that the time to build Quasi-Cubes scales well with 
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the size and dimensions of the dataset. The time to build the small Quasi- Cube 
is not exactly ^ of the time needed to build the larger one, simply because 
the region we chose contains less than the average number of non-zero cells of 
regions in the larger cube (286 non-zero cells were present in the smaller cube, 
while the average per region in the larger cube is 561.68). The process of building 
Quasi-Cubes is CPU bound. Only 45.22 seconds were spent waiting for I/O when 
building the large Quasi-Cube. The system does not require anymore I/O than 
the necessary to bring the relation (in steps) to memory. Figure 3 show the sizes 
of the error bins and the entire relation. 



Dimensions 


Terror 


Tcore 


597 X 42 X 966 x 1764 x 4 x 69 
1 X 42 X 966 X 1764 x 4 x 69 


4,580 sec 
6.0 sec. 


74.84 sec 
0.1 sec. 



Figure 2: Time to populate error bins and build core cuboid 



1 


2 


Errc 
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94,585 


86,556 


137,136 


95,947 


112,438 


238,337 


764,993 



Figure 3: Sizes of the error bins and fact table. 



Figure 4 show the times needed to retrieve the target data from each one of 
the error bins when using online aggregation and the time to retrieve the tuples 
from a single fact table, with no online aggregation support. The first column 
shows the percentage of data retrieved by the query. The time to present the 
first answer by using the data models is not reported in the tables, since it is 
several orders of magnitude smaller than any of the other values. (Computing 
estimates and errors for a million cells takes roughly 23 seconds of CPU time.) 

The results show the benefits of online aggregation. Evaluating the query in 
the first error bin (highest error) takes almost an order of magnitude less time 
than waiting for the query to be evaluated in the entire fact table. Hence, the 
user can have a quick, first answer within a fraction of the time that it takes 
to evaluate the entire query. As expected, the query evaluation times shown in 
Figure 4 are proportional to the error bin sizes (and the size of the entire fact 
table). It is important to point out that for many queries, the initial estimated 
error is going to be much less than that of an individual cell, since experiments 
show that aggregating cell values tends to diminish the error considerably. 
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Figure 4: Results of online aggregation. 



4 Conclusions 

In this paper we have presented a technique that enables online processing of 
data cube queries. It does so by distributing the tuples in the fact table into 
a series of tables, or error bins, each corresponding to a level of error incurred 
when estimating the values of the cells in the core cuboid to which these tuples 
contribute. The estimations are computed by using models built over chunks of 
the data cube. These models can be drawn from a variety of techniques, such as 
linear regression, loglinear models, and wavelets. 

We have shown via experiments that the technique is capable of producing 
quick, yet approximate answers to the queries, reducing the latency normally 
experienced by queries posed to the fact table directly. The answer gets refined 
over time as the user is looking at it, and the process can be stopped at will. 

We are currently experimenting with the usage of other model than the one 
reported in this paper (linear regression), as well as considering ways to bound 
error estimates for aggregations (not individual cells) more tightly, so the initial 
and subsequent error levels presented to the user are even smaller than the ones 
that can be reported by the current implementation. 
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Abstract. In this paper, we propose a new technique for multidimensional query 
processing which can be widely applied in database systems. Our new technique, 
called tree striping, generalizes the well-known inverted lists and multidimension- 
al indexing approaches. A theoretical analysis of our generalized technique shows 
that both, inverted lists and multidimensional indexing approaches, are far from 
being optimal. A consequence of our analysis is that the use of a set of multidimen- 
sional indexes provides considerable improvements over one d-dimensional index 
(multidimensional indexing) or d one-dimensional indexes (inverted lists). The 
basic idea of tree striping is to use the optimal number k of lower-dimensional 
indexes determined by our theoretical analysis for efficient query processing. We 
confirm our theoretical results by an experimental evaluation on large amounts of 
real and synthetic data. The results show a speed-up of up to 3 10% over the multi- 
dimensional indexing approach and a speed-up factor of up to 123 (12,300%) over 
the inverted-lists approach. 

1. Introduction 

The problem of retrieving all objects satisfying a query which involves multiple at- 
tributes is a standard query processing problem prevalent in any database system. The 
problem especially occurs in the context of feature-based retrieval in multimedia data- 
bases [3], but also in relational query processing, e.g. in a data warehouse. The most 
widely used method to support multi-attribute retrieval is based on indexing the data 
which means organizing the objects of the database into pages on secondary storage. 
There is a variety of index structures which have been proposed for this purpose. One of 
the most popular techniques in commercial databases systems is the inverted-lists ap- 
proach. The basic idea of the inverted-lists approach is to use a one-dimensional index 
such as a B-tree [4] or one of its variants for each attribute. In order to answer a given 
range query with s attributes specified, it is necessary to access s one-dimensional index- 
es and to perform a costly merge of the partial results obtained from the one-dimensional 
indexes. For queries involving multiple attributes, however, the merging step is prohib- 
itively expensive and is the major drawback of the inverted-lists approach. Multidimen- 
sional index structures have been developed as an efficient alternative approach for 
multidimensional query processing. The basic idea of multidimensional index structures 
such as space-filling curves [14, 8], grid-file based methods [13, 5], and R-tree-based 
methods [7, 2], is to use one multi-attribute index which provides efficient access to an 
arbitrary combination of attributes. 
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It is well-known that multidimensional index structures are very efficient for databas- 
es with a small number of attributes and outperform inverted lists if the query involves 
multiple attributes [11]. In many real-life database applications, however, we have to 
handle databases with a large number of attributes. For databases with a larger number 
of attributes, the performance of traditional multidimensional index structures rapidly 
deteriorates. Therefore, specific index structures for high-dimensional data have been 
proposed. Examples include the TV-tree [12], SS-tree [19], and X-tree [1]. For high 
dimensions (larger than 12), however, even the performance of specialized high-dimen- 
sional index structures decreases. 

In this paper, we propose a new approach, called tree striping, for an efficient multi- 
attribute retrieval. The basic idea of tree striping is to divide the data space into disjoint 
subspaces of lower dimensionality such that the cross-product of the subspaces is the 
original data space. The subspaces are organized using an arbitrary multidimensional in- 
dex structure. Tree striping is a generalization of the inverted lists and multidimensional 
indexing approaches, which may both be seen as the extreme cases of tree striping. 

The rest of this paper is organized as follows: Section 2 introduces the basic idea of 
tree striping including the algorithm necessary for processing queries. In Section 3, we 
then provide a theoretical analysis of our technique and show that optimal query pro- 
cessing is obtained for tree striping. We also show that optimal tree striping outperforms 
the traditional inverted lists and multidimensional indexing methods. In Section 4, we 
then discuss the more elaborate query processing algorithms which make use of the 
specific advantages of “striped” trees and therefore further improve the performance. 
Section 5 provides the details of our experimental evaluation which includes compari- 
sons of tree striping to inverted lists and two multidimensional index structures, namely 
the R-tree and the X-tree. The results of our experimental analysis confirm the theoreti- 
cal results and show substantial speed-ups over the multidimensional indexing and the 
inverted-lists approaches. 

2. Tree Striping 

Our new idea presented in this paper is to use the benefits of both the inverted lists and high- 
dimensional indexing approaches in order to achieve an optimal multidimensional query 
processing. Our approach, called tree-striping, generalizes both previous approaches. 

2.1 Basic Idea 

The basic idea of tree-striping is to divide the data space into disjoint subspaces of lower 
dimensionality such that the cross-product of the subspaces is the original data spaced 
This means that each subspace contains a number of attributes (dimensions) and each 
object of the database occurs in all subspaces. For example, the three-dimensional data 
space (customer_no, discount, turnover) may be divided into the one-dimensional sub- 
space (customer_no) and the two-dimensional subspace (discount, turnover). Obvious- 
ly, the dimensionality of the subspaces is smaller than the dimensionality of the data 
space, and hence, we are able to index the subspaces more efficiently using any multidi- 
mensional index structure. 

To insert an object, we divide the object into subobjects according to the division of 
the data space. Then, we insert the subobjects in the multidimensional index structure 
managing the corresponding subspace. To process a query, we divide the query accord- 



1. Note that a division of the data space into disjoint subspaces is different from a partitioning 
of the data space where the partitions have the same dimensionality as the original data 
space whereas subspaces have a lower dimensionality. 
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ing to the division of the data space and issue the subqueries to the relevant multidimen- 
sional indexes. In a second step, we merge the results which have been produced by the 
indexes using an external sorting algorithm such as merge sort. The general idea and 
query processing strategy of tree striping is presented in Figure 1 . 

Note that, in contrast to inverted lists, in general, the selectivity of subspace indexes is 
relatively high because each index manages information about more than one attribute. 
Therefore, the amount of partial results produced in the first step is rather small which 
means that the cost for the merging step are not significant. Our formal model, which 

It is clear that the number and 
dimensionality of the data 
space divisions are important 
parameters for the perfor- 
mance of our technique. The 
optimal division mainly de- 
pends on the dimension, the 
number of data items, and the 
data distribution. The parame- 
ters have to be chosen ade- 
quately to achieve an optimal 
performance. For a uniform 
data distribution, the parame- 
ters for an optimal division 
into subspaces can be obtained 
easily from the theoretical 
analysis (cf. Section 3). 

2.2 Definition of Tree Striping 

In this Section, we formally define the tree striping technique. In the following, we 
consider objects as vectors in a vector space and attributes as components of the vectors. 
Given is a data space of dimension d and extension [0..!]“^, N vectors v having compo- 
nents Vq ... Vii.j and an arbitrary multidimensional index structure MIS supporting the 
relevant query types. First, we need a mapping which assigns the dimensions to the 
different subtrees. 

Definition 1: (Dimension Assignment): 

d ^0 — 1 

The dimension assignment DA is a mapping R (^R , ..., R ) of a d-dimen- 
sional vector v to a vector of k dj-dimensional vectors w\ 0 < / < k , such that the follow- 
ing conditions hold: 
k-l 

1. = d 2. \fjQ<j<d,3lO<l<k,3iO<i<di: Vj = w\ 

i = 0 

3. 0<l<k, V/ Q<i<di, 3jQ<j<d\ w\ = v. 

Note that w\ denotes the /-th component in the Z-th index. To clarify the definition of 
dimension assignment, we provide a simple example: Given a 5-dimensional data space 
(d=5). We may define a dimension assignment DA^^^ such that k = 2, do = 3, and d j 



will be presented in Section 3, confirms this fact. 
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= 2, i.e. divides the data space into two subspaces of dimensionality 3 and 2. 

Explicitly, maps even dimensions to the first subspace and odd dimensions 

to the second subspace, more formally, DAg^^ = (^0> ^l)> ^0 = ('^0’ '^2’ ’^1 ~ 

(vj, V3). Thus a vector v= (0, 4, 6, 5, 1) is mapped to DAg^^ even((0> 4, 6, 5, 1)) = ((0, 6, 
1), (4, 5)). Obviously, DAg^^ meets the conditions specified in Definition 1 because 
all dimensions of the data space have been mapped to a subspace and vice versa. 

Using the definition of dimension assignment, we are now able to formally define 
tree-striping: 

Definition 2: (Tree Striping): 

Given a database DB of N d-dimensional vectors and a dimension assignment DA. 
Then, a tree-striping TS is defined as a vector of k dj-dimensional indexes 

M1S‘ = {w‘} ,0<l<k, with w' = Da\v) ,ve DB. 



Tree striping as defined in Definition 2 is a generalization of the previous approaches. 
For the special case of k = d, tree striping corresponds to inverted lists because the 
dimension assignment produces d one-dimensional data objects; and for the special case 
of k = 1 , tree striping corresponds to the traditional multidimensional indexing approach 
because we have one J-dimensional index. The most important question is whether 
there exists a tree striping which provides better results than the extremes (the well- 
known inverted lists and multidimensional indexing approaches). In particular, we have 
to determine whether there exists ak(l <k<d) such that tree striping outperforms the 
other approaches. In the next Section, we introduce a theoretical model showing that an 
optimal k exists. Our experimental analysis presented in Section 5 confirms the results 
of our theoretical model and shows performance improvements of up to a factor of 120 
times over the inverted lists and up to 280% over the multidimensional indexing ap- 
proach. A second open question is how the attributes (dimensions) are assigned to the 
different trees such that the performance improvement is optimal. In Section 4, we dis- 
cuss the implications of different dimension assignments and also introduce optimized 
algorithms for query processing using striped trees. 

Note that tree striping as defined so far 



SetOfObject query(TreeStrip ts, QuerySpec qs) 
{ 

int i; 

SetOfSubObject sst[ts.num]; 
SubQuerySpec sqs[ts.num]; 

SetOfObject st; 

// for all indexes 

for (i = 0; i < ts.num; i++) 

{ 

// query i-th index with sub-query 
sqs[i] = ts.opt_dim_assign(i, qs); 
sst[i] = ts.index[i].query(sqs[i]); 

// sort result by primary key 
sst[i].sort(); 

} 

// now merge single results 
St = merge(sst, ts.num); 
return st; 

} 

Fig. 2: A First Query Processing Algorithm 



is independent of the multidimensional 
index structure used. Any multidimen- 
sional index structure such as the R-tree 
[7] and its variants (R-f-tree [17], R*- 
tree [2], P-tree [9]), Buddy-tree [16], 
linear quadtrees [6], z-ordering [14] or 
other space-filling curves [8], and grid- 
file based methods [13, 5] may be used 
for this purpose. 

Before we describe our theoretical mod- 
el, we first provide a simple algorithm 
for processing queries using striped 
trees. As the single indexes do not have 
all information about an object, but only 
about some attributes of the object, in 
general, we have to query all indexes in 
order to process a query. We therefore di- 
vide the query specification qs into sub- 
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query specifications sqs[l] according to the dimension assignment. Then, we query each 
single index with the sub-query specification sqs[l] and record the results. In a final step, 
we have to merge the results by sorting the single results according to the primary key of 
the objects or any object identificator. Figure 2 shows a first version of a query processing 
algorithm. An optimized version for querying striped trees is provided in Section 4. 

3. Theoretical Model 

As already mentioned, most of the multidimensional indexing approaches efficiently 
solve the multi-attribute retrieval problem on low-dimensional data. From our experience, 
in real-life database projects, we have learned that even for relational database systems, 
handling relatively high numbers of attributes (more than 10) occurs, for which the perfor- 
mance of traditional index structures deteriorates. To process arbitrary queries (e.g., point, 
range, and partial match queries) efficiently on those databases, we have to equally index 
all the attributes which means that we have to deal with a high-dimensional data space. 

Unfortunately, some mathematical problems arise in high-dimensional spaces which 
are usually summarized by the term ‘curse of dimensionality.’ A basic effect in high- 
dimensional space is the exponential growth of the volume: Let us assume a database of 
1,000,000 uniformly distributed objects consisting of 20 numerical attributes in the 
range [0...1]. Let us further assume that we are interested in a query which provides 10 
result objects located around the midpoint of the data space (0.5, 0.5, 0.5, ..., 0.5). Which 
range do we have to query in order to obtain 10 result objects? Obviously, we have to 

assure that the volume of our query range equals to 10/1,000,000 = 10 ^ , as the vol- 
ume of the data space equals to 1. This leads to a query range in each attribute of 

» 0.56 . So we have to query the range (0.22-0.78, 0.22-0.78, ..., 0.22-0.78). 

That means a query with a selectivity of 10'^ leads to a query range of 0.56 in each 
attribute in a 20-dimensional data space. 

Considering these effects, we are able to provide a concise cost model of processing 
range queries in a high-dimensional data space using the tree striping technique. For the 
following, we assume a uniformly distributed set of N vectors in a (7-dimensional space 
of extension [O..!]"^. Note that, although we assume a uniform distribution of the data, 
our model can be applied to real data as well (cf. Section 5). We will use the cost model 
to determine the optimal number of trees and accordingly the dimensions of the trees for 
a given data set, i.e. the optimal dimension assignment. 

Our cost model is divided into two parts: First, the cost arising from querying the 
striped trees, and second, the cost for merging the results of the striped trees into one 
final result. Both cost functions are highly influenced by the dimensions of the striped 
trees. The index lookup cost is growing super-linearly with growing tree dimension. 
However, the merging cost is growing super-linearly with the size of the result, which is, 
in turn, falling with dimension of the trees. This fact implies the assumption that the total 
cost could form a minimum where both costs are moderate. This minimum should be 
located anywhere between the (7-dimensional index and the inverted-lists approaches. 

Several cost models for queries on multidimensional index structures have been pub- 
lished. The most suitable ones for our purposes are the model of Kamel and Faloutsos 
[10] and the similar model of Pagel, et.al. [15]. We decided to use the model of Kamel 
and Faloutsos as a basis for our considerations. We therefore assume that the multidi- 
mensional index structure aggregates a fixed number of vectors into a data page 
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such that the hounding box containing the vectors forms a square-shaped hyper-rectan- 
gle with the (hyper-) volume „ 



Thus, denotes the actual fan-out of the in- 
dex. From that, the edge length o of a typical 
bounding box is ^ 

Next, we determine the probability that such a 
data page is loaded when processing a square 
range query with volume V^. Analogously, we 
dfv . The probability can be determined using 



Minkowski-.3^ | 







Fig. 3: The Minkowski Sum aata page is loaaea when processing a square 

range query with volume V^. Analogously, we 
compute the edge length qofV^as, q = ‘ifVq ■ The probability can be determined using 
the so-called Minkowski sum. Intuitively, the Minkowski sum of two areas a \ and 02 can 
be constructed by painting aj using 02 as a brush (cf. Figure 3). If both areas are multi- 
dimensional rectangles, we have to add the side-lengths of a\ and U 2 accordingly. Thus, 
the Minkowski sum of query volume and the bounding box is 



Mink{Vn 






The Minkowski sum Mink{Vgg, equals to the probability that a randomly lo- 
cated bounding box and a randomly located query intersect. Thus, the expected number 
of data pages intersecting Vq is Mink{ Vgg, V^) multiplied by the number of pages: 



Cedd) 

Pindexid N, q) = 



= I + q- 






The number of data vectors in a data page also depends on the dimension d of the 
vectors. Assuming that each coordinate value is stored as a 32-bit floating point value 
and that there is an additional unique object identifier which also requires 32 bit, we 
determine Cgj^ as: 

„ _ page-size ■ storage-utilization 

“ 4-{d+ 1) 

The cost for combining the results of the multidimensional index accesses mostly depend 
on the selectivities of the indexes. If IFRSI is the size of the final result set of query Q then 
IIRS,I is the intermediate result set produced by the i-th Index having dimension dj. Thus: 







Note that we have to sort each intermediate result set according to the object identifi- 
ers in order to be able to merge them into the final result set. We have to apply an external 
sorting algorithm since, for larger q or minor df, the result set will exceed the available 
main memory. According to Ullman [18], the cost for performing multiway merge-sort 
on a relation of B blocks is 2B ■ log^(B) where M is the number of cache pages avail- 
able to the sorting process. We can store the object identifiers in a densely packed fash- 
ion such that lIRSjl object identifiers require 4 • |lRSj| /page-size pages. From that, the 
cost for sorting the result set of a single index are: 




. ^(4-N-q\ 

^ • logM^ ^ ) 

page-size ™ page-size 






250 S. Berchtold et al. 



To determine the total cost, and Pj^dex have to be summed up for all striped trees. 
For merging the result sets, each of them has to be scanned once more. Total cost is: 



k ^ 



P{d.,N,q) = ^ 



I + q ■ 



N 






c. 






A-N-q ' 
page-size 



1 + 2 ■ log^( 



(4 



N-q')^ 



page-size 



y J 
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Figure 4 shows the total cost over ^ in a typical setting with a database of 1 ,000,000 
uniformly distributed objects in a 15-dimensional data space. The selectivity of the que- 
ry is 0.01%. There is a clear minimum between k=2 and k=3. 

Thus, we are able to determine an 
optimal k by solving the follow- 
ing equation: 




^P{d, k, N, q) 
ok 



0 (eq. 1). 



Fig. 4: Total Cost for Processing a Range Query 



The analytic evaluation of this 
equation yields a rather large for- 
mula which is omitted due to 
space limitations. A MAPLE- 
generated C function determin- 
ing the derivative can be used to 
calculate the optimum. 
Unfortunately, the cost model 
presented so far is accurate only 
in the low-dimensional case. This is caused by the fact that in high-dimensional data 
spaces the data pages cannot be split in each dimension. If we split a 20-dimensional 
data space once per dimension, we obtain 2^^=1,000,000 data pages. Obviously, the 
number of data objects would have to grow exponentially with the dimension in order to 
allow one split per dimension. Therefore, we provide a special high-dimensional adap- 
tation of our cost model. Our extension assumes that data pages are split only in the first 
d" dimensions where c/’ is the logarithm of the number of data pages to the basis of two: 

d’ = ^ ) ■ 

^effd-i) 

The data pages have the average extension 1/2 in d’ dimensions and extension 1 in all 
remaining dimensions {d-d’’). When determining the Minkowski sum, we additionally 



For c/q, ...., only whole numbers are meaningful. This effect is handled later, but is of 
minor importance for our cost model. 
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Fig. 5: Optimal Dimension Assignment 



have to consider that only a part of the volume is located inside the data space because 
in the dimensions which have not been split, the extension of the Minkowski-sum is still 
1, rather than (1+^): 
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Thus, the expected number of data pages accessed in the high-dimensional case is: 
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Adding the sort cost we obtain the following total cost for high-dimensional data spaces: 
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4. Query processing 

For optimal response times, we have to make two decision: We first have to choose an 
adequate dimension assignment and second, we have to choose the right strategy for 
processing queries. 

As a result of the theoretical analysis presented in Section 3 there exists an optimal 
number k of striped trees, which can be determined according to our cost model 
(cf. equation 1). Since A: is a real number, however, we cannot directly use k as a param- 
eter for our query processor. Instead, we use the floor of k 

Kpt = L^J 

and then determine the optimal dimensionality of our trees given by 

dop, = ld/k\ . 

Since in general, (k^^^ ■ is smaller than d we have to distribute the remaining 

drem ~ ^ ~ 
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void insert(TreeStrip ts, object t) 

{ int 1; 

SubObject st[ts.num]; 

// for all indexes 

for (1 = 0; 1 < ts.num; 1++) 

{ // determine sub-objects 

st[l] = ts.opt_dim_assign(l, t); 

// insert sub-objects into 1-th index 
ts.index[l].insert(st[l]); 

} 



attributes to our trees. Thus, we obtain d 



trees 

iK 



opt 



with dimensionality {dgpf + 
trees with 



rem 

and 



1 ) 

dimensionality 



^opi ■ In the following, we have to distinguish 
between two cases: The first case is that we 



have additional information about the selec- 
tivity of the attributes, which usually occurs 
for relational databases. The second case is 
that we have no additional information which 
usually occurs in indexing multimedia data 
} using feature vectors. Let us first consider the 

more general case that we do not have any ad- 
ditional information and therefore assume 
that all attributes have the same selectivity. In this case, the optimal dimensionality d^^pf 
of our trees may be used to define the following Optimal Dimension Assignment. 



Fig. 6: Insertion Algorithm 



Definition 3: (Optimal Dimension Assignment): 

The dimension assignment DA^pj is a dimension assignment according to Definition 1 
such that: 
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Intuitively, the optimal dimension assignment assigns the i-th component of the orig- 
inal vector V to a component of one of the vectors such that the first vector receives 

the first c/q components (vq...v^ _^) , the second vector accommodates the compo- 
nents ( V j . . . V j , j , ) and so on. 

Using the optimal dimension assignment according to Definition 3, we now are able 
to present the insert algorithm of our tree striping technique, as depicted in Figure 6. In 
order to insert an object t, we simply divide t into a set of kgp^ sub-objects st[l] (using the 
optimal dimension assignment) and insert them into the according striped tree ts.in- 
dex[l] (0 < / < . 

A more complex algorithm is required for processing queries on striped trees. A rather 
simple query processing algorithm has already been presented in Section 2. The algo- 
rithm depicted in Figure 2, however, has a major drawback: Let us assume that we have 
to process a partial range query PRQ which specifies attributes a, b and c: 

PRQ = {**,[ap aj, **,[bp bj, **,[Cp cj, *,*} . 

Let us further assume that all these three attributes are located in the first of the striped 
trees. Obviously, it does not make sense to query any tree other than the first tree because 
all other trees do not have any selectivity. The algorithm presented in Figure 2, however, 
executes queries on all trees ignoring the expected selectivity of the trees. In order to 
process queries efficiently we have to take the selectivity of a tree into account and query 
a tree only if the expected gain in selectivity is worth the cost of querying the tree. 

Another potential improvement of the query processing algorithm can be exemplified 
by the following situation: Assume that the three specified attributes a, b and c in the 
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above example are spread over two striped trees Tq managing attributes a and b, and T j 
managing attribute c. After querying tree Tq we will typically receive a set of answers 
(candidates) which may contain some false hits. This assumption holds because the 
selectivity of Tq is much higher than the selectivity of T j. If we furthermore assume to 
have meaningful queries, i.e. queries having a good selectivity on all attributes, in gen- 
eral the set of candidates will be small. In this case, the cost for loading the candidate 
objects from secondary storage and checking if the objects fulfill the query specification 
may be lower than the cost of querying additional trees. 

Let us now consider the second case 
where we do have some additional 
information about the selectivity of 
the attributes. A different selectivity 
of the attributes may be induced by 
the attributes of different data types 
(e.g., a boolean attribute usually has a 
selectivity of 50%) and by different 
data distributions. We can use this in- 
formation to adapt the optimal di- 
mension assignment. If we are able to 
query the tree containing the at- 
tributes with the highest selectivity 
first, the resulting set of candidates 
will be rather small and will contain 
only a few false hits. Therefore, que- 
ry processing can be finished without 
querying the other trees. This means 
that, if we have information about the 
selectivity of attributes, we should 
sort the attributes according to their 
selectivity before applying the di- 
mension assignment^ Note that in 
some cases, a non-uniform division 
may lead to better results. For exam- 
ple, let us assume that we have ob- 
jects with 9 attributes (a, b, ... i), that 
equals to 3, and that the attributes 
a tod have a high selectivity whereas 
the selectivity of attributes e to i is 
rather low. Then, it is beneficial to di- 
vide the objects into sub-objects (a, b, c, d), (e,f), and (g, h, i) which would be a sub- 
optimal division assuming no a-priori knowledge about the selectivity of attributes. 

Considering all these effects, we are now able to provide a more sophisticated algo- 
rithm for the processing of queries on striped trees. The algorithm (cf. Figure 7) first 
determines whether a linear search of the database is expected to be cheaper than a 
search using trees which may be the case for very large queries. The algorithm then sorts 
the striped trees according to their selectivity, i.e. the tree which probably provides the 



SetOfObject query(TreeStrip ts, QuerySpec qs) 

( 

int i, cost_index, cost_linear; 

SetOfSubObject sst[ts.num]; 

SubQuerySpec sqs[ts.num]; 

SetOfObject st; // set of candidates 
// sort indexes according to selectivity 
ts.sort_index(qs); 

// determine sub-queries 
for (i = 0; i < ts.num; i++) 

sqs[i] = ts.opt__dim_assign(i, qs); 
i = 0; 

// estimate cost 

cosLindex = cost„modell(sqs[0]); 
costjinear = cost_linear_scan(sqs[0]); 
while (i < ts.num && 

cosLindex < costjinear) 

{ // query index 

sst[i] = ts.index[i].query(sqs[i]); 

// sorted merge of result 
sst[i].sort(); 
merge(st, sst, ts.num); 

// estimate cost 
if (i < ts.num) 

( costjndex = cost„modell(st, sqs[i+l]); 
costjinear = costjinear(st); 

1 



if (i < ts.num) 

( // load attributes 

database.load(st); 
remove_false_hits(st, qs); 

1 

return st; 

Fig. 7: Query Processing using Tree Striping 



1. Note that this operation involves not only the query-processing but also the dimension 
assignment, since we have to ensure that the attributes with the best selectivity are assigned 
to the first trees. 
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smallest set of candidates is queried first. If the querying of the first tree leads to a small 
set of candidates, we determine whether loading these candidates from secondary stor- 
age is cheaper than querying the second tree. If this is the case, we load the attributes and 
output all candidates fulfilling the query specification. Otherwise, we query the second 
tree. This process iterates until all trees have been queried or the candidates are loaded 
and processed. 

As the implementation of multidimensional index structures is complex, the assign- 
ment of different data types such as strings and floating numbers into one tree is not 
practicable. The division of a object may therefore be induced not only by the expected 
performance improvement but also by other considerations. Obviously, this can lead to 
sub-optimal dimension assignments. Our practical experience, however, shows that a 
slightly sub-optimal dimension assignment performs nearly as well as the optimal di- 
mension assignment. 

5. Experimental Analysis 

To show the practical relevance of our method, we performed an extensive experimental 
evaluation of tree striping and compared it to the inverted lists and the multidimensional 
indexing approach. All experimental results have been computed on an HP9000/780 
workstation with several GBytes of secondary storage. For the experiments, we used an 
object-oriented implementation (C-H-) of the R*-tree [2] and the X-tree [1]. 

The test data used for the experiments are real data consisting of text data describing 
substrings of a large database of texts, and synthetic data consisting of uniformly distrib- 
uted points in high-dimensional space. The block size used for our experiments is 
4 KByte, and all query processing techniques were allowed to use the same amount of 
cache. For a realistic evaluation, we used very large amounts of data (up to 80 MBytes) 
in our experiments. The total amount of disk space occupied by the created indexes 
(inverted lists, multidimensional indexes and tree-striped indexes) is about 2 GBytes 
and the CPU time for inserting the data adds up to about one week. 

In a first experiment, we confirmed 
our theoretical result (cf. Section 3) 
that the tree striping technique as a 
generalization of the lists and multi- 
dimensional indexing approaches 
outperforms both other techniques. 
For the experiment, we used 
1,000,000 uniformly distributed 
data objects of varying dimension- 
ality (<f = 2..16). We built the ac- 
cording indexes (R*-tree) and que- 
ried the indexes with a selectivity of 
10'^ which corresponds to an ex- 
pected result of about 10 hits. In or- 
der to avoid statistical effects, we 
used the average cost of 100 uni- 
formly distributed query windows. The observed variance was rather small. We com- 
pared different tree stripings (varying the value of k) and determined the optimal dimen- 
sion assignment (optimal value of k). The tested dimension assignments for the 16- 
dimensional data set are (16), (8, 8), (6, 5, 5), (4, 4, 4, 4), (2, 2, 2, 2, 2, 2, 2, 2), and 
(1, 1, ..., 1, 1). The data sets of other dimensionality have been tested analogously. In 
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Fig. 8: Comparison of Measured Optimal Dimension 
Assignment and Model Estimation 
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a. Speed-Up over Inverted Lists b. Speed-Up over Multidimensional Indexing 



Fig. 9: Speed-Up of Tree Striping for an Increasing Number of Data Items 



Figure 8, we show the optimal dimensionality (d^pf) of striped trees depending on di- 
mensionality of the data. For d=2 and c/=4, the optimal dimension assignment of tree 
striping provides one tZ-dimensional index, i.e. it is identical to multidimensional index- 
ing. As expected according to our theoretical analysis, for higher dimensions the optimal 
dimension assignment of tree striping is between the extreme cases: Forii=12, we obtain 
two 6-dimensional indexes and for c/=16, we obtain a division into 3 indexes with di- 
mensionality (6, 5, 5). Note that in all experiments, the optimal dimension assignment 
estimated by our cost model exactly matches the measured optimum. For our experi- 
ments, we therefore use the optimal dimension assignment as determined by our cost 
model. 

Another important criterion for the evaluation of indexing techniques is their scalability, 
that is the behavior of the technique for an increasing size of the database. We therefore 
performed an experiment using a fixed dimensionality {d=\6) and a fixed query selectiv- 
ity of 10'^ and varied the number of data items from 10,000 to 1,000,000. Again, we used 
our cost model to determine the optimal dimension assignment. The speed-up over multi- 
dimensional indexing starts with a moderate value of 107% for a small database but, as the 
size of the database increases, the speed-up also increases to up-to 230% over multidi- 
mensional indexing for the largest database of 1,000,000 objects (cf. Figure 9). The 
speed-up over the inverted-lists approach starts with 228% and reaches its maximum of 
2,000% (20 times faster) for the largest database of 1,000,000 objects (cf. Figure 9). 

The intention of the experiment de- 
picted in Figure 10 is to show that the 
high speed-ups are independent from 
the selectivity of the queries. We re- 
peated the previous experiments for 
different dimensionality (shown are 
the experiment for d=\2 and <7= 16) us- 
ing selectivities between 10'^ and 
10'^. Again, we obtained a speed-up 
of 210% to 220% over the multidi- 
mensional index and a speed-up factor 
of 4 to 20 over the inverted lists. 

To show the practical relevance of our 
technique, in a last series of experi- 
ments, we evaluated the tree striping 
technique using real data which consists of text data describing substrings of a large 
database of texts. 




B Inverted List 
Q Mult. Index 



Fig. 10: Performance for Varying Selectivities 
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Fig. 11: Optimal Dimension Assignment 
for Real Data (Text Data) 



Fig. 12: Performance of 
Paitial Range Queries (Text Data) 



In Figure 1 1 , we compare the measured performance for range queries with a selectivity 
of 0.2% to the performance determined by our model (cf. Section 3). The minima of the 
two curves correspond to the optimal dimension assignment Note that the model 

estimates the optimal dimension assignment correctly ~ although it assumes a 
uniform distribution of the data. The difference between model and measurements for 
large dimensions (i.e. small k), however, may be explained by the non-uniform distribu- 
tion of the real data. In Figure 12, we present the speed-up of tree striping over inverted 
lists and multidimensional indexing for partial range queries with a varying number of 
attributes specified (j=4..8). It is interesting that for a partial range query with 4 at- 
tributes specified, tree striping degenerates to inverted lists. If more than 4 attributes are 
specified, tree striping becomes better than both, inverted lists and multidimensional 
indexing. Note that for s=6, inverted lists are better than multidimensional indexing, 
whereas for j=8, multidimensional indexing is better than inverted lists. 

6. Conclusions 

In this paper, we propose a new technique for multidimensional query processing, called 
tree striping. Tree striping is a generalization of the inverted-lists technique and the 
multidimensional indexing approach. A theoretical analysis of our technique shows that 
tree striping clearly outperforms both - the inverted lists and multidimensional indexing 
approaches. An experimental evaluation of our technique confirms the results of our 
theoretical analysis, unveiling significant speed-up factors of tree striping over inverted 
lists and multidimensional indexing for different databases of varying size and dimen- 
sionality, as well as for different query types. 

Our future work will include an application of tree striping to other multidimensional 
index structures not handled in this paper and we expect a substantial performance im- 
provement over the non-striped version. We further plan to develop a parallel version of 
the tree striping technique. We expect a nearly linear speed-up for a parallel version 
since the separate indexes can be queried independently, which should provide a linear 
speed-up of the query processing time. 
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Abstract The application of data mining algorithms needs a goal-oriented pre- 
processing of the data. In practical applications the preprocessing task is very 
lime consuming and has an important influence on the quality of the generated 
models. In this paper we describe a new approach for data preprocessing. Com- 
bining database technology with classical data mining systems using an OLAP 
engine as interface we outline an architecture for OLAP-based preprocessing that 
enables interactive and iterative processing of data. This high level of interaction 
between human and database system enables efficient understanding and prepar- 
ing of data for building scalable data mining applications. Our case study taken 
from the data-intensive telecommunication domain applies the proposed method- 
ology for deriving user communication profiles. These user profiles are given as 
input to data mining algorithms for clustering customers with similar behavior. 



1 Introduction 

The telecommunication industry is faced right now especially in Germany with a grow- 
ing competition. ’’Knowing the customers” is one of the most important steps towards 
customer-specific pricing and offering special tariffs. Data Mining is a very promising 
technology to tackle the Customer Relationship Management (CRM) task that becomes 
more and more important in all industries. 

The telecommunications domain is very data-intensive. The immense amount of 
data makes classical desktop data mining systems hardly applicable. Especially the 
tasks of data integration and preprocessing are hard to handle. The preprocessing task 
that is very time consuming in real data mining projects gets more and more difficult. 

Recent work emphasize the promising direction of integrating database technology 
and data mining (see e.g. [7], [9]). We propose merging the scalable database tech- 
nology using a data mart with classical data mining systems and algorithms. In our 
approach the interface for this combination consists of an OLAP (online analytical pro- 
cessing) engine that allows a high level of interaction between human and database sys- 
tem. As far as possible the most preprocessing operations are processed in the database 
system. Additionally, the consistent way of performing preprocessing in a database is 
scalable and can easily be revoked and reused. 

The paper is organized as follows. In section 2 we draft our analysis scenario. The 
analysis scenario describes our real-world application of handling the complex, high 
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dimensional and enormous amount of communication data. We further explain in sec- 
tion 3 the general core of our approach called OLAP-based preprocessing. Taken from 
the telecommunication domain we present a case study using OLAP-based preprocess- 
ing. In section 4 we outline how communication profiles are generated using call detail 
records and show how these profiles are used for customer segmentation. Section 5 
gives an short overview over similar works and section 6 concludes with an outlook on 
further work. 

2 A Telecommunication Analysis Scenario 

Telecommunication companies make their business by selling communication minutes 
to their customers. Considering these minutes as the main product different properties 
must be distinguished that determine the variation of the product. These properties or 
dimensions include, for example, the daytime, weekday or weekend, the calling dis- 
tance etc. Calculating the different possible combinations of attribute values for just 
these three dimensions gives a rather large number. For example, taking every hour of a 
day as value for the dimension daytime, distinguishing just between weekday and week- 
end and considering 10 different calling distances (e.g. city, international, . . . ) gives 24 
X 2 X 10 = 480 different combinations called communication features. The Deutsche 
Telekom distinguishes internally even many more than just 10 calling distances. 

From a customer care point of view the Deutsche Telekom aims at identifying 
groups of customers which typically buy the same or similar products. Understand- 
ing the needs of the customers special offers can be made and the tariffs can be adjusted 
to the customer behaviour. Realizing this difficult task using data mining we build in a 
first step a communication profile for each customer. A communication profile should 
represent the typical calling behavior of a customer over a rather long period of time. 
Having analyzed the domain and the data given in form of call detail records two diffi- 
culties arose. First, a customer buys only one product at a time which is contrary to the 
classical market basket analysis. Second, the large number of single transactions (call 
detail records) must be aggregated to a more meaningful level. 

In our scenario call detail records of around 4500 private customers have been stored 
in a panel for 5 years. In the average a private customer has between 3 and 7 phone calls 
a day. Thus, approximately 12.000.000 call detail records were produced and stored 
in flat files. In addition to this communication data social-demographic data for these 
private customers was collected. 

After having integrated the communication as well as the social-demographic data 
in one data mart we started to load the data from the data mart into a data mining sys- 
tem. First, we intended to perform all necessary preprocessing within the data mining 
system. This approach, however had the heavy drawback that all time-consuming pre- 
processing steps had to be executed again and again using new data. Additionally, this 
way of preprocessing within the data mining tool was not scalable, reusable and flexible 
as necessary for our complex and data-intensive domain. 

3 Our architecture for OLAP-based preprocessing 

Practical experiences in the development of data mining applications in general have 
shown, that the sub-phases data connection and integration, data understanding and 
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preprocessing take the majority of the whole process time. Whereas preprocessing plays 
a minor role in scientific research it is of highest relevance in real-world applications. 
The pure application of data mining algorithms in the development process requires 
only a petty length of time. However, the quality of the generated models is heavily 
depending on the algorithm specific data preprocessing. 




Figurel. Procedure of preprocessing data, in the lower part extend with OLAP functionality 

In the upper part of figure 1 the typical procedure of preprocessing the data is shown. 
Preprocessing is performed in interaction with understanding the data. To understand 
the data different methods can be applied: One way is to perform multivariate statis- 
tical methods to calculate some data characteristics. An other way is the exploratory 
visualization of the data. After having gained some ideas and hints about the data 
goal-oriented preprocessing operations can be performed. We distinguish between three 
groups of preprocessing operations: Data reduction, data derivation and data transfor- 
mation. The process of data reduction is partitioned into horizontal and vertical reduc- 
tion. Likewise, data derivation can also be performed in a horizontal and vertical man- 
ner. Horizontal derivation adds new individuals to the data set (also called balancing) 
and vertical derivation means the generation of new attributes by combining existing 
ones. Data transformation includes operations like applying mathematical functions, 
discretization or categorization, normalization and replacement of values. 

Handling very large amounts of data, the iterative and interactive task of data under- 
standing and preprocessing becomes more and more time consuming. In the following 
we will show how the preprocessing task described earlier can be performed more ef- 
ficiently on large amounts of data. The lower part of figure 1 shows the idea of our 
OLAP-preprocessing framework. We propose using an OLAP-engine to have a flexible 
way for understanding and preprocessing large datasets. Generating reports with tables 
and graphs by applying OLAP operations (see lower left part of figure 1 ) is usually used 
to obtain valuable information for business decisions. In our case we do not only want 
to apply OLAP for data understanding or visualization, but also to create a target data 
set that can be used to generate new hypotheses by applying a data mining algorithm. 

The idea described above was realized by the architecture depicted in figure 2. We 
consider the legacy data to be integrated in a data warehouse or data mart as a neces- 
sary prerequisite for efficient data analysis. Therefore, the data integration process is 
performed only once in our framework and all continuing steps are performed on the 
data warehouse or data mart. On the one hand side a hyper-cube can be modeled on 
top of the data mart, on the other hand a knowledge discovery process can be started 
based on the data in the data mart. An OLAP-engine is used in the modeled hyper-cube. 
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OLAP is typically performed for the user-driven validation of hypotheses. OLAP func- 
tionalities include drill-down, roll-up, slice and dice, pivoting operations for flexible 
handling and transforming the data (see [1]). The knowledge discovery process con- 
sists of various phases that have to be passed (cf. CRISP-DM standard process model 
for knowledge discovery [3]). The phases typically enclose the tasks of data selection, 
preprocessing, model generation and interpretation like it is shown in figure 2. Data 
mining algorithms are executed on the target data set to generate models which have to 
be interpreted concerning the business relevant questions. 




Figure!. Architecture for OLAP-based preprocessing 

A target data is derived from applying one of the preprocessing operations described 
earlier in this section. As indicated in figure 2 by the arrow from the hyper cube to the 
data preparation part of the knowledge discovery process the data created using the 
OLAP functionalities on top of the hyper-cuhe can also serve as preprocessed target 
data for data mining algorithms. The OLAP-preprocessed data is loaded into the data 
mining system where further preprocessing can be performed if necessary. 

4 A case study 

As mentioned in section 2 the Deutsche Telekom uses a panel to analyze the commu- 
nication behavior of its customers. For 5 years every phone call of around 4500 private 
customers has been logged with their aggreement. The information stored in anony- 
mous form includes the number of calls per day, the duration of each call, the calling 
distance, the daytime, weekday etc. Additionally, a well-known market research insti- 
tute was charged with inquiries at these private customers to get social-demographic 
information in order to better classify and describe different customer groups. In the 
present time of a very competitive and unsteady telecommunication market in Germany 
the Deutsche Telekom tries to extract valuable information from these data sources for 
goal-oriented marketing actions and new pricing activities. To solve the desired require- 
ments for a quick and efficient analysis of the different information sources it was de- 
cided to develop a data mart called PAS (for panel analysis system) that contains all 
relevant information. Data mining methods should he applied to the PAS such that the 
data mart serves as a decision support system for marketing and pricing actions. 

In a case study using the PAS data mart it was the goal to describe each panel 
customer by a communication prohle and build groups of customers that have similar 
prohles. A communication prohle consists of typical characteristics (e.g. a similar num- 
ber of international calls at certain times on a weekday) of the customer. Generating the 
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desired communication profiles was realized using the idea of OLAP-preprocessing that 
was introduced in section 3. 

4.1 Application of OLAP-Preprocessing 

Our approach presented in section 3 was applied to the case study described in the 
last section. To derive the target data we performed the most preprocessing steps hy 
applying various OLAP functions. Because of the complexity indicated in section 2 we 
restricted the communication data to 3 month for building a customer profile. 

In a first approach, the calling minutes of each customer were aggregated over the 
hour in which the call started, every day of the week and the calling distance. However, 
this approach was cancelled since the number of communication features (over 1000!) 
was too complex to handle. An exploratory analysis revealed that the data was too 
detailed. In a second approach, the calling minutes were summed up over 6 hour slots 
starting from 0am to 6am, 6am to 12pm and so on. Furthermore, we distinguished 
just between weekdays (Monday thru Friday) and weekend (Saturday and Sunday). 
The values for the dimension calling distance was also cut to three. As a result we 
got the number of 24 communication features (4 x 2 x 3 = 24) used to describe the 
communication behavior of a single customer. On the left side of figure 3 an average of 
all panel customers with the chosen 24 communication features is depicted. The middle 
of the x-axis separates weekdays from weekends, while the first 12 communication 
features represent the weekdays section. Weekdays are further separated in three areas 
depending on the calling distance (city, regional, national). Furthermore every calling 
distance is subdivided into the 4 time windows mentioned earlier. The y-axis represents 
the average communication minutes in dependence of the communication feature. 

The desired aggregated information was returned by an OLAP tool in form of a sin- 
gle table that contained all communication transactions of the panel customers within 
3 months. In a next OLAP preprocessing step the “pivot” functionality transformed the 
column representation of a single customer to the more appropriate form of a row rep- 
resentation. A single row consists of a customer id and the calling minutes aggregated 
for each of the 24 communication features. Using simple visualization techniques to 
view the distribution of values for each communication feature it became evident that 
all communication features are more or less left-slanted distributed. As a prerequisite 
for applying the cluster algorithm the data had to be transformed from a left-slanted 
distribution to a symmetric distribution. This was achieved by transforming the calling 
minutes of each communication feature by tbe logarithmic function. 



4.2 Clustering and Interpretation 

Using the well-prepared and preprocessed data the task of building various customer 
segments with its characteristic communication behavior could he addressed. The well- 
known and successfully approved k-means clustering method (see [6]) was selected and 
applied to the data. The nmnber of clusters k was set to 10. To determine the number k, 
we executed a hierarchical clustering algorithm on a sample set (cf. [8]). The absolute 
number of elements in the 10 clusters varies from 109 to 111 private customers. Analyz- 
ing the smallest cluster revealed that these private customers were identified as outliers. 
Their calling minutes were significant above the average communication behavior of 
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private customers and look more like small enterprises. The remaining nine clusters 
yielded a concentration of private customers with rather homogeneous communication 
behavior within each cluster. Indeed, the found clusters huilt a very good basis for seg- 
menting different groups of private customers. The average profile of each cluster (see 
left side of figure 3) could be visualized by using so called error diagrams. For each of 
the 24 communication features the mean value and a 95-% confidence interval is calcu- 
lated and represented for each cluster. By using this simple visualization technique very 
interesting differences between the clusters could be detected concerning the usage of 
special communication features. 

On the right hand side of figure 3 a cluster with 777 members is represented. The 
communication behavior of the cluster members significantly differs from the average 
communication behavior of all 4500 panel customers. This cluster is very similar in city 
calls on weekdays and weekends compared to the average of all customers. However, 
all other communication features are obviously different (see figure 3). 
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Figures. Communication profiles on left side of all customers, on right side of one cluster 

As mentioned earlier in this section the PAS data mart contains additional social- 
demographic data derived from the customer inquiries. These include, for example, 
information concerning the size of the household or the net income of the customer. 
The demographic data can be accessed by an OLAP tool like the communication data 
and is added in form of columns to the table described in section 4.1 such that it can be 
used for further analysis. It is intended to use this information for generating intensional 
rules that explain better each derived cluster. 

5 Related Work 

Few methodologies have been proposed for the preprocessing task. In [4] it is described 
how recommendations for preprocessing steps can be derived for classification tasks 
that base on the calculation of data characteristics. Calculating data characteristics on 
complex and very large data sets is very difficult and in some cases impractical and rely 
on statistical assumptions that do not hold in real world domains. 

The relationship between OLAP and data mining is examined by Parsaye (see [9]). 
He describes an architecture that combines OLAP and data mining applications and 
shows how data mining applications depend on different aggregation levels. Han’s 
group (see [7]) focuses its research on the area of OLAP mining. One of the motivations 
for OLAP mining was the perception that different aggregation levels were necessary 
for their pattern analysis. Both approaches concentrate on data mining algorithms but 
hardly on other parts of the knowledge discovery process. 
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6 Conclusion & Further Work 

In this paper we presented an approach for efficient and scalable data understanding and 
preprocessing in real world domains. Using OLAP technologies for preprocessing of- 
fers quite a lot of advantages: OLAP applications are usually interactive and can handle 
very well large data sets. 

Our approach presented in this paper aims at simplifying the problem of prepro- 
cessing real world data based on an existing OLAP environment. The performed pre- 
processing steps within the OLAP environment leads to well prepared target data for 
the following clustering task. That way we were able to derive interesting clusters. The 
communication profiles derived in the case study are based on the assumption that 24 
communication features are suited to describe the customer behavior. Existing works 
that build user profiles (e.g. [2], [5]) in a similar way make pretty much the same as- 
sumptions, but use different aggregation levels. Applying our architecture for OLAP- 
based preprocessing we are able to reconstruct different aggregation levels in a very 
efficient and easy way. The determination of the right aggregation level is a nontrivial 
task. In our further research we intend to investigate the influence of different aggrega- 
tion levels on clustering algorithms and on other data mining methods. 

Acknowledgements The work presented in this paper was partially financed by an 
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Abstract. Metaquery (also known as metapattern) is a datamining tool 
useful for learning rules involving more than one relation in the database. 
A metaquery is a template, or a second-order proposition in a language 
that describes the type of pattern to be discovered. In an earlier paper we 
discussed the efficient computation of support for Meta-queries. In this 
paper we extend this work by comparing several support computation 
techniques. We also give real-life examples of meaningful rules which were 
derived by our method, and discuss briefly the software environment in 
which the meta-queries were run (the FLEXIMINE system). Finally we 
compare Meta-queries to Association rules and discuss their differences. 



1 Introduction 

With the tremendous growth in information around us, datamining is emerging 
as a vital research area among the AI and Databases communities [6]. Metaque- 
rying [8, 9] is a very promising approach for datamining in relational or deductive 
databases. Metaqueries serve as a generic description of the class of patterns to 
be discovered and help guide the process of data analysis and pattern generation. 
Unlike many other discovery systems, patterns discovered using metaqueries can 
link information from many tables in databases. These patterns are all relational, 
while most machine-learning systems can only learn propositional patterns. 

Metaqueries can be specified by human experts or alternatively, they can be 
automatically generated from the database schema. Either way, they serve as a 
very important interface between human “discoverers” and the discovery system. 
A metaquery R has the form 

T < — Li , ..., I/m (1) 

where T and L, are literal schemas. Each literal schema Li has the form Qi{Yi , ..., 
where all non-predicate variables Yjt are implicitly universally quantified. The ex- 
pression Qi{Yi, Yn.) is called a relation pattern. The right-hand-side Li, ..., Tm 
is called the body of the metaquery, and T is called the head of the metaquery. 
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The predicate variable Qi can be instantiated only to a predicate symbol of the 
specified arity n*. (note, however, that in our real-life experiments we allowed the 
projection of relation with larger arity into such predicate). The instantiation 
must be done in a way which is consistent with the variable names. 



student 


course 


Ron 


Calculusl 


Dana 
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Intro to CS 
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mathematics 


Algebra 


mathematics 


Intro to CS 


computer science 


student 


department 




Ron 


mathematics 




Dana 


mathematics 




Ana 


computer science 





Fig. 1. The relations student-course, course-department, and student-department 



For example, suppose that you have a database having the relations depicted in 
Figure 1. Let P,Q, and R be variables for predicates, then the metaquery 

R(X,Z)^P(X,Y),Q(Y,Z) (2) 

specifies that the patterns to be discovered are transitivity relations 

r(X,Z) i — p(X, Y), q(Y,Z) where p,g, and r are specific predicates. One possible 

result of this metaquery on the database in Figure 1 is the pattern 

stud-dep(X, Z) i — stud-course(X, T), course- dep(y, Z) (3) 

which means intuitively that if a student takes a course from a certain depart- 
ment then he must be a student of that department. Note that some or all of 
the attributes may be instantiated, e.g. the rule: 

stud- dep(X, math) < — stud-course(X,Y), course-dep(Y,math) means that the 
above statement is true for math students. 

It turns out that the notion of Meta-query is related to the concept of Associ- 
ation rules [1]. It is also related to Schema query languages such as SchemaSQL. 
The discussion section will discuss these relationships. 

1.1 The notion of support and confidence for metaqueries 

Suppose that we are given the metaquery (2) again, but instead of the relations 
shown in Figure 1, we have added the corresponding rows and columns shown 
in Figure 2 (the relation course-department is the same as in Figure 1). 
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Fig. 2. The new relations student-course and student-department 



The rule (3) doesn’t hold in all cases anymore. But we can say that it holds in 
75% of the cases, or in other words, that it has confidence 0.75. 

On the other hand, let us consider the Meta-query: 

R{X,Y,W)^P{X,Z),Q{Y,Z) (4) 

One of its instatiated rules may be: 

stud-dep(X, Z, sex — F') i — stud-course(X, Y), course- dep(y,Z) (5) 

and this rule has a confidence of 100%! even though it involves only one half of 
one of the relations. This motivates of-course the notion of support similar to 
that notion in association rules [1] . 

Hence, each answer to a metaquery is a rule accompanied by two numbers: 
the support and the confidence. The threshold for the support and confidence is 
provided by the user. Intuitively, the support indicates how frequently the body 
of the rule is satisfied, and the confidence indicates what fraction of the tuples 
which satisfy the body also satisfy the head. Similar to the case of association 
rules, the notions of support and confidence have two purposes: to avoid present- 
ing negligible information to the user and to cut off the search space by early 
detection of low support and confidence. Formally, given a rule 

ti — n,...,rm, (6) 

let J denote the relation which is the equijoin of ri, ...,rm, and let Jt be the 
relation which is the equijoin of J and t. Where y and x are some relations, let 
yx, be the projection of y over the fields which are common to y and x, and let 
\x\ be the number of tuples in x. For each i,i = l...m define 5, to be the fraction 
In [2] we first presented the following definition for support of meta-queries: 
The support of the rule (6) is the maximum over Si, i = Less formally, the 

support is the maximum fraction of any relation r, in J. The confidence of (6) 
is the fraction of t that appears in J, or, formally, the confidence of (6) is 
The support that we have defined has the following useful property: 
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Claim. For any two relations r, and Vj in the rule (6) that have at least one 
common attribute variable, Si is not bigger the the fraction of r* that participates 
in the equijoin between r, and rj. 

This property enables us to get an upper bound on the support of the rule (6) 
by performing pairwise equijoins instead of equijoin of all the relations in the 
body of the rule. In [2] we compared our definition of support to that of Shen 
et al. [9], and showed that it is more general and intuitive. 

The main part of [2] was devoted to present efficient algorithms to compute 
support. We briefly review these algorithms here in Section 2. In Section 3 we 
compare the performance of two of the Support computing algorithms: Brave 
and Cautious. In Section 4 we present some examples of real-life rules derived 
by our Meta-query method and discuss its software environment (the Fleximine 
system [5] ). In Section 5 we compare Meta-queries to association rules and 
schema query languages, and we conclude in Section 6. 



2 Support computation algorithms 

How hard is answering a metaquery? In [2] this question was shown to be NP- 
hard in the number of attributes and relations. The naive algorithm is to try to 
instantiate each predicate and variable with all possible relations and attributes 
subject to the unification constraints, for each such instantiation to compute 
the support, and for those instantiations which pass the support threshold to 
compute the confidence. In practice though, many such instantiations will have 
very low support and detecting this low support early, will make the above 
computation much more efficient! this is discussed next. 

The process of answering a metaquery can be divided into two stages. In 
the first stage, which we call the instantiation stage, we are looking for sets of 
relations that match the pattern determined by the metaquery. In the second 
stage, which we call the filtration stage, we filter out all rules that match the 
pattern of the metaquery but do not have enough support and confidence. 

The process of instantiating a metaquery is similar to solving a Constraint 
Satisfaction problem (CSP) [4] where we are basically looking for all solutions 
of the CSP problem. In our experiments we used a very simple CSP algorithm 
(forward checking with Back-jumping [3]) but other more advanced algorithms 
may be used. The basic instantiation algorithm was presented in [2] and will not 
be shown here. 

The filtration stage itself is composed of two steps: filtering out rules with 
low support, and filtering out rules with low confidence. We compute confidence 
only for rules with sufficient support. In our research we have focused on the 
following algorithms for computing support: 

The Join approach the straightforward way: computing the equijoin of the 
body of the rule, then computing Si {Si as defined in section 1.1) for each 
relation in the body, and then taking the maximum. 
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compute-support (n , Vm) 

Input: Set of relations ri,...,rm, where each of the at- 
tributes is bound to a variable. A support threshold 
MinSupport. 

Output: if the rule who’s body is ri, has support 

equal or larger than MinSupport, return the support, 
otherwise return —1. 

1. RelSetCopy = RelSet = {ri, LowSupp = 

true; 

2. While {RelSet ^ 0) and LowSupp do 

1. Let r € RelSet', 

2. s = Si-VL'phovLnd{r, RelSetCopy)', 

3. if s > MinSupport then LowSupp = false 
else RelSet = RelSet — {r}; 

3. If LowSupp then return —1 

else return Join-support(ri, r^) 



Fig. 3. computing support for a rule body 



The histogram approach Using histograms for estimating support. Comput- 
ing the exact support only for rules with high estimated support. 

The histogram + memory approach same as the histogram approach, ex- 
cept that we store intermediate results in memory, and reuse them when we 
are called to make the same computation. 

The procedure Join-support(ri, computes the support Si of each re- 

lation by performing the Join as explained in section 1.1. and then returns 
Max{5j|l <i< m). The other two approaches compute an upper bound on the 
support and then compute the exact support only for rules with high enough up- 
per bound of support. The idea is summarized in Algorithm compute-support in 
Figure 3. Note that once one relation with high Si is found, we turn to compute 
the exact support using Join. 

The procedure 5,-upbound called by the algorithm compute-support returns 
an upper bound for the value Si for a single relation r* in the body of the rule. 
This can be done by one of the two procedure: 5,-upbound-brave or 5j-upbound- 
cautious, shown in Figures 4 and 5, respectively. The basic idea is that an upper 
bound can be achieved by taking the join of a relation r* with any other relation 
with which r, has a variables in common. Procedure 5j-upbound-brave does this 
by picking one arbitrary relation with which r, has a common variable, and 
procedure 5j-upbound-cautious does this by considering all relations with which 
Vi has variables in common, and taking the minimum. Procedure 5j-upbound- 
cautious works harder than procedure 5j-upbound-brave but it achieves a tighter 
upper bound and hence can save more Join computations. 

The procedure upbound called by procedure 5j-upbound-cautious or pro- 
cedure 5j-upbound-brave uses either the the histogram approach or the his- 
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togram+memory approach. The histograms approach exploits the fact that his- 
tograms are easy to construct and are quite useful for support estimation. 
The algorithms to compute the support using histograms (i.e. the procedure 
up-boundJiisto) were also presented in [2] and will not be repeated here. 

The evaluation of the above support computation algorithms was also re- 
ported in [2]. The three strategies described above were compared and tested 
using only the procedure 5j-upbound-brave. The main conclusion as expected 
was that the two estimating methods were better than the Join method espe- 
cially when the support threshold was large. This, since in those cases, they had 
a better oppurtunity to cut the search space. 



5j-upboud-brave(ri ,S) 

input: A relation n and a set of relations S. 
output: An upper bound on Si for a rule whose body is 
S[j{ri}. 

1 . s = 1.0 

2. If there is r' 6 5 such that r and 

ri have a common variable X then 
s =upbound(ri, r' , att{ri, X),att{r' , X)); 

3. return s; 



Fig. 4. computing Si bravely 



5i-upboud-cautious(ri,5) 

input: A relation n and a set of relations S. 

output: An upper bound on Si for a rule whose body is 

1. s = 1.0 

2. for each relation r’ in S such that r, and r’ have vari- 

ables in common 

do for each common variable X of ri and r 

do 

s' =upbound(ri, r' ,att{ri, X),att{r' ,X))\ 
if s' < s then s=s’; 

3. return s; 



Fig. 5. computing Si cautiously 
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2.1 Comparing the Brave vs. Cautious 



As was mentioned above, in [2] we only used the Brave algorithm for computing 
support. We later extended the experiment and measured the performance of 
Brave vs. Cautious. 

In Figure 6 we see the performance of the Brave vs. the Cautious algorithms 
for two support threshholds. The concept of ’’Miss” means that the upper-bound 
computed by the algorithm was above the threshold while the actual support 
value was below the threshold. As can be seen, from the graphs, the total number 
of misses for the first 20 queries is more for the Brave then for the Cautious and 
the difference is reduced as the support threshold is increased. The explanation 
may be that for low threshold values the mistakes made by the algorithms are 
more sensitive. This behaviour is not consistent though with very high support 
threshold, so it needs further investigation. In anycase, it seems that if the case 
is that many instant queries will ’’fail” because of low support (not the case 
reported here), the difference between the algorithms becomes less significant, 
and the Brave method should be satisfactory. 









Misses vs. query number (support threshold = 0.3) 
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Fig. 6. Comparison of Brave vs. Cautious 
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3 Meta-queries implementation in FlexiMine 

FlexiMine [5] is a prototype KDD system currently being developed at Ben- 
Gurion University for testing techniques and algorithms for Data Mining and 
Knowledge Discovery and their evaluation in the context of real-life databases 
and users. The emphasis of this system is on integration of most KDD operations, 
(including database access and selection, preprocessing and abstraction, data- 
mining algorithms such as: Association rules, Baysian knowledge-bases. Decision- 
trees), and on extensibility. Thus, the system facilitates incorporation of new 
algorithms or their improved variants, and convenient extension of support to 
new databases or abstraction hierarchies. 

Usually a user of FlexiMine is presented with a screen containing the schema 
of the selected relational database. The user then goes through a process of defin- 
ing the specific set of attributes on which she wants to perform data-mining algo- 
rithms. The results of this selection is a flat file of records, since algorithms such 
as Assoc, rules or Decision-trees work on fiat files. As part of this process, the 
user may apply the process of Abstraction. In real-life databases there may 
be attributes with continous numeric values. These attributes need to be ab- 
stracted into large grain categories, in order for the particular mining algorithm 
to succeed finding item with high enough support. This process was described 
in detail in [5]. 

The Meta-queries algorithm is different of other data-mining algorithms in 
that it access the original database directly without going through a selection 
procees. This is obviously needed since Meta-queries use heavily the Schema 
information. On the otherhand, the Abstraction process is important! There are 
two ways of handling abstractions. Some fields have their default abstractions 
which are used by the Meta-queries. Other fields may be instantiated during 
the process and may not have any linked abstraction. Such fields are abstracted 
during the instantiation process using some huristics which takes into account 
the number of distinct values in the domain. 

The metaquery module is implemented in SQL embedded in C-|— 1-. At present, 
we use syntax similar to Prolog or Datalog. Since, in our system, it is possible 
to replace one of the relation or field variables with a constant ( in order to 
specify a pattern involving a specific field or a specific relation) , such a constant 
is enclosed in quotes. For example, in the following Meta-query: 

”Course.Student”(X, Y, ’’Grade”), ’’Students” (X, W, ’’Age”) ->A (Y,Z, U) 

we are looking for rules connecting grade results and students age to the 
name of the course (e.g. a possible rule is: young students are likely to get 80 or 
higher in Intro to CS...) 

We tested the Meta-queriy algorithm on several databases. One of these 
databases was an Hospitalization database of about 600,000 records. Following 
are some interesting instants of rules we received from this database. The Meta- 
query was as follows: 

’’diagnosis” (X, ’’name” ), ”hospitaLreeord(X, Y,) ->Z(X, U,V) 
which means: Find relationships between diagnosis and name of disease, the 
hospital record, and any other relation in the database on the right side. We 
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also used the following abstraction on Age which is a discrete numeric value: 
’adult’ means age between 26-54. ’old’ - 55 - 95, and ’young’ - 1-25 

1. diagnosis (patient-id, name =’Psyehiatrie-disorders’), 
hospitaLreeord(patient-id,a,reason=’birth’) 

->personal(patient-id, age-group= ’adult ’,sex= ’female ’) 

Support=0.025 Confidenee=0.790 

2. diagnosis (patient-id, name=’Neurologie-disease’) 
hospitaLreeord(patient-id,a,releasestatus = ’death’) 

->personal(patient-id, age-group= ’old’, sex= ’male ’) 

Support=0.002 Confidenee=0.344 

3. diagnosis (patient-id,name=’Neurologie-disease’) 
hospitaLreeord(patient-id,a,releasestatus = ’death’) 
->personal(patient-id,age-group=’old’,sex= ’female’) 

Support=0.002 Confidenee=0.281 

4- diagnosis (patient-id, name= ’Immunologie-desease ’) 
hospitaLreeord(patient-id,a,releasestatus = ’healthy’) 

->personal(patient-id, age-group= ’young ’ ,maritalstatus= ’single ’) 
Support=0.004 Confidenee=0.302 

The first instance is quite obvious and often occurs in Association rules. 79% 
of the patients diagnosed with psyciatric-disorder and were admitted because 
they gave birth were women of the age-group ’adult’ The second and third 
instances show a slight difference between females and males for death from 
the same (hospitalization) reason. The fourth instance shows other attributes 
that were instantiated by the Meta-query algorithm. 

Note also that the support values for the above queries is quite low. If we 
would have abstracted the Diagnosis disease codes into higher level diseases, we 
probably would have gotten higher values of support. This shows the importance 
of Abstraction in the Meta-query system, and we are currently implementing 
such abstraction based on the ICD9 clinical coding scheme. 

4 Discussion 

4.1 Association rules vs. Meta-queries 

From the previous discussion and examples it seems that there is a large sim- 
ilarity between Meta-queries and Association rules. Yet, although in principle 
the same rules or knowledge may be derived by both (or by Decision trees and 
other methods), there are still several significant differences. 

The first issue is the use of schema information. Rules from Meta-queries 
include the Relation names in the rule. Now, in most cases these names have 
important meaning which is directly observable by looking at the rule, while in 
case of Association rules, one has to remember where the various items came 
from and what is their meaning. Furthermore, if the database contains many 
views which represent particular ’’slices” of the database, one can find interesting 
queries which involve such views (with high support). 




274 R. Ben-Eliyahu-Zohary and E. Gudes 



A second issue is that Meta-queries operate on the database itself, while 
association rules require a phase of Selection and Pre-processing. The result of 
this phase must be a well organized fiat file with well defined transactions. This 
phase, usually ignored by most researchers, is quite painful and is not at all 
obvious. The main problem here is the problem of Normalization. Assume a 
common database of students and courses with three relations: 

Students (Stud-Id, Name,...), Course(Course-Id,Name,...) 
Student-Course(Course-Id, Student-Id, Grade,...) 

If one joins the relations Students and Student_Course and if one looks for 
rules involving attributes such as Age or Sex and Grade, one will get wrong 
results, since Age and Sex values will be biased towards students taking many 
courses! The solution is of-course Denormalization and the construction of Trans- 
aetions where each transaction corresponds to one student, and course grades 
are reported as individual new fields (or their maximum, or average). This pro- 
cess is supported directly in Fleximine and is called there Aliasing [5]. This 
process though is not automatic, the user must be aware of it, and must direct 
the system what type of Denormalization or Aliasing to perform. Such a process 
is unnecessary with Meta-queries, since the Join structure is preserved. 

Finally, meta-queries can discover more ’’general” knowledge than Associa- 
tion rules. Consider the Meta-query (2) in Section 1, and the instance (3). Such 
a rule cannot be discovered directly by an Association-rule algorithm, because 
it involves a constraint on set of values, (in some sense, the resulted rule is an 
Integrity constraint). One can define during the pre-processing stage a new 
attribute which will get the value of 1, each time the join row of a student and 
the course it takes results with equality of the student and course departments. 
But, in order to do that, one has to know the desired result during the pre- 
processing stage, clearly an unrealistic assumption! Such a pre-knowledge is not 
required at all by the Meta-query method. For all these reasons, meta-queries 
are useful along with Association rules for Knowledge discovery. 

4.2 Comparison with Schema query languages 

Recently, there is a development of query languages over the sehema as well as 
over the content of a database. We briefly compare Meta-queries to one such 
language, SehemaSQL [7]. The motivation for such languages is the querying 
of Web data, and the integration of hetrogeneous databases. One must be able 
to query the schema in those cases. The similarity, with Meta-queries is that in 
both, variables can ’’run” on relations and attributes names. There are, however, 
significant differences because of the different motivation. First, the result of a 
SehemaSQL query is a single reation (or set of relations), while the result of 
a Meta-query is a set of instantiated queries, each is then evaluated separately 
against the original database! Second, the notions of support and confidence are 
essential for Meta-queries and help to cut the search space, as well as to give only 
relevant results. These concepts are completely missing from SchemSQL queries. 
Thirdly, some of the operators of SehemaSQL (e.g. split) are non-relevant to 
Meta-queries, since they may create new relations from tuples (values) with 
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very low support. On the other-hand other operators (e.g. unfold) may be useful 
to solve the Denormalization problem mentioned above for Association rules. 
Finally, some of the optimization techniques used by schemaSQL may be useful 
for Meta-queries too, but this requires further investigation. 



5 Conclusion 

This paper extends previous research on metaqueries in several ways. We first 
review our notion of support for a rule generated according to a pattern, and we 
present efficient algorithms for computing support. We then compare the perfor- 
mance of two algorithms for computing support bounds: the Brave and Cautious 
algorithms. Then we discuss the implementation of Meta-queries withing the 
FlexiMine system framework. We also present some results of rules generated 
from meta-queries on a real-life database. Finally, we compare Meta-queries to 
Association rules and to Schema query languages and discuss their corresponding 
advantages. Future work will both extend the power and syntax of the Meta- 
query system, and will investigate their optimization and performance under 
various conditions. 
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Abstract. Iceberg queries are to compute aggregate functions over an 
attribute (or set of attributes) to find aggregate values above some spec- 
ified threshold. It’s difficult to execute these queries because the number 
of unique data is greater than the number of counter buckets in memory. 
However, previous research has the limitation that average functions were 
out of consideration among aggregate functions. So, in order to compute 
average iceberg queries efficiently we introduce the theorem to select 
candidates by means of partitioning, and propose POP algorithm based 
on it. The characteristics of this algorithm are to partition a relation 
logically and to postpone partitioning to use memory efficiently until all 
buckets are occupied with candidates. Experiments show that proposed 
algorithm is affected by memory size, data order, and the distribution of 
data set. 



1 Introduction 

^ The recent research[l][3] has paid attention to iceberg problem. Iceberg prob- 
lem in database means the relation between a lot of data and a few results is 
similar to it between an iceberg and the tip of one. Iceberg CUBE problem 
was introduced in [3]. This problem is to compute only those group-by parti- 
tions with an aggregate value above some minimum support threshold. Iceberg 
queries were introduced in [1]. These queries have four characteristics; (1) com- 
puting aggregate functions (2) about large data (3) of which domain size(the 
number of unique data) is greater than the number of counter buckets, and (4) 
returning results above threshold. 

By the following cases it’s necessary to compute them. One is when the 
amount of data is very large like data warehousing[l][3], market basket analy- 
sis[4][8][9] in data mining, clustering[10][ll] and so on. Another is when the sizes 
of data are large like multimedia database[5][6]. The other is when memory is 
limited like an embedded system[12]. However, average functions have been out 
of consideration until now. It’s because proposed algorithms in [1] are based on 
coarse counting principles [7] [8] and the computation of average by these princi- 
ples will generate false-negatives(some results are missed). We’ll discuss further 
into this matter in section 2. 

^ This work was supported by the Brain Korea 21 Project. 

Y. Kambayashi, M. Mohania, and A M. Tjoa (Eds.): DaWaK 2000, LNCS 1874, pp. 276-286, 2000. 

© Springer- Verlag Berlin Heidelberg 2000 




Partitioning Algorithms for the Computation of Average Iceberg Queries 277 



SELECT targetl, target2, targetn, avg(rest) 
FROMR 

GROUPS Y targetl, target!, ..., targetn 
HAVING avg(rest) >= T 



nationality 
(target 1 ) 


job 

(target!) 


income 

(rest) 


Korea 


professor 


1000 


America 


manager 


I!00 


China 


doctor 


!000 


Korea 


professor 


1500 



(a) Average Iceberg Query (AIQ) (b) relation R 

Fig. 1. Average Iceberg Query and relation R 



In this paper, we’ll focus on only average iceberg queries(AIQ). An example of 
AIQ is shown in Fig. 1., based on a relation K(targeti,target 2 , targetn, rest) 
and a threshold T. In this relation, rest attribute(income) have to be numeric. 
For the computation of AIQ, we’ll introduce the theorem to select candidates 
by means of partitioning, explain BAP algorithm based on it, and improve BAP 
into POP algorithm. The idea of partitioning is similar to it proposed in [4] 
for mining association rules. The difference is that the amount of data in each 
partition is all the same in [4], but so is the number of unique data in this paper. 

The rest of this paper is structured as follows. In section 2 we discuss the 
limitation in related work. In section 3 and 4 we introduce and prove the theorem 
to select candidates, and present partitioning algorithms based on this theorem. 
In section 5 we evaluate our algorithms. We conclude in section 6. 

2 Related Work 

Iceberg CUBE problem introduced in [3] is to compute only those group-by 
partitions with an aggregate value above some minimum support threshold. The 
basic CUBE problem is to compute all of the aggregates as efficiently as possible. 
The chief difficulty is that the CUBE problem is exponential in the number of 
dimensions: for d dimensions, group-bys are computed. So, [3] introduced 
iceberg CUBE problem and used the pruning technique similar to the pruning 
in Apriori[9]. For example, if <a3, *> and <*, b2> made minimum support, 
then <a3, b2> would be added to the candidate set. If <a3, *> met minimum 
support but <*, b2> didn’t, then <a3, b2> can be pruned. 

Iceberg queries problem was introduced in [1]. [1] proposed algorithms based 
on coarse counting. The most important feature is to use hashing-scan in coarse 
counting. The simplest form of this technique uses an array A[l..m] as counter 
buckets and a hash function h. In the first scan, each target is hashed and the 
counter bucket for it is incremented by one. After hashing-scan, targets in buckets 
above specified threshold become candidates. Next scan is performed about only 
candidates. This method can be extended to sum functions by incrementing 
counter not by one, but by value. 

However, both the pruning technique proposed in Apriori[9] and coarse count- 
ing principle can’t be applied to average functions. If this is applied to average 
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Average Threshold = 3.5 



relation 



professor 


5 


manager 


1 


doctor 


4 


professor 


3 



h(professor)=l 
h(manager) = 1 
h(doctor)=2 



counter after coarse counting 



hash value 


sum 


count 


1 


9 


3 


2 


4 


1 



Fig. 2. A False Negative Example 



functions, a false negative example like Fig. 2 will happen, {professor} meets 
threshold 3.5 because of sum=8 and count=2, but {professor} is omitted from 
candidates. It’s because average is 3 (sum=9 and count=3) in the hash bucket 1 
with {professor} and {manager}. That is, {manager} below threshold prevents 
{professor} from being included in candidates. This matter was discussed in [2], 
too. 



3 Selection of Candidates 

In this section, we define notations and prove the theorem to select candidates. 



3.1 Notations 



For simplicity, we present our algorithms in the next section in the context of a 
relation R, with < target, rest > pairs. We assume for now we are executing a 
simple average iceberg query that groups on the single target in R, as opposed 
to a set of targets. Our algorithms can be easily extended for multiple target 
sets. 

A partition p C R refers to any subset of the tuples contained in R. Any 
two different partitions are non-overlapping, i.e., pi Cipj = $,i ^ j, and union of 
all partitions is the relation, i.e., pi Up 2 U ... U Pn = R- Each bucket in counter 
C consists of < target, sum-val,cntjval >. sum-val is the sum of rest of tuples 
with the same target, cntjval is the number of tuples with the same target. And, 
|C| is defined as the number of all buckets in counter C. 

The form of functions used in this paper is function-name{tar get, range) . 
When there are a target t, a partition p, and relation R consisted of n partitions, 
count(t,p) is defined as the number of tuples, of which targets are the same as 
t within p, each function is as follows. 



count{t,p) 

sum{t,p) = rest, avg{t,p) 



j.count(t,p) 

count (t,p) 



n 

count(t, R) = count(t,pi) 

i=l 
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n n count(t,pi) 

sum{t,R) = ''^sum{t,pi) = E E rest 



avg(t, R) 



sum(t, R) 
count(t, R) 



ELi rest 

EILi count {t, Pi) 



3.2 Theorem for Selecting Candidates 

Theorem 1 . The average threshold T is given and relation R is partitioned into 
n partitions (pi,p 2 5 ■■■,Pn)- Then, for any target t 

avg(t, R) > T (3i avg(t,pi) >T), i = 1,2, ,.,n. 

The converse preposition is not satisfied. 



Proof. Suppose that there is a target t, of which average value doesn’t meet T 
in each partition but in R. From this assumption, we have 

\/i avg(t,pi) = — - — r < T, i = l,2,..,n. (1) 

count[t^Pi) 

Multiplying both sides of (1) by count(t,pi) yields 



count{t,pi) 

\/i rest <T ■ count {t. Pi), i = l,2,..,n. 



(2) 



Replacing upper part of avg(t,R) with (2), we obtain 



avg(t, R) 



Eti rest , ELi T ■ count{t,pi) 

count {t. Pi) count {t, Pi) 



(3) 



This conflicts with the assumption. Therefore, if a target is above or equal 
to threshold in a relation, it must always be above or equal to threshold at least 
in a partition p. 



Theorem 1 is similar to pigeon hole’s theory. According to theorem 1, we can 
prune targets below threshold in all partitions. That is, if avg{t,p) meets T in a 
partition p, t will be included into candidates. Notice that this theorem is always 
satisfied without regard to how a relation is partitioned. In next section, we’ll 
discuss how to partition a relation. 



4 Partitioning Algorithms 

In this section, we introduce BAP(BAs*c Partitioning) algorithm, and propose 
POP(POstponed Partitioning) algorithm improved from BAP algorithm. These 
algorithms are different only in Gen_Partition procedure. 
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4.1 Common Parts of Algorithms BAP and POP 

The basic idea of algorithms BAP and POP is to partition a relation logically to 
find candidates. These algorithms consist of two phases - the phase partitioning a 
relation and selecting candidates, and the phase computing exact average value 
of candidates. The common parts of partitioning algorithms are described as 
follows. 

Algorithm Partitioning 

input : Relation R, Threshold T 

begin 

// Phase 1 : Partitioning and Selecting Candidates 

second_scan := false; 

bucket _num := 0; 

for all tuple d in R do 

if there is the bucket for d. target t in Counter C then 
Update (C,d) ; 
else 

if bucket_num < |C| then 
Insert (C,d) ; 
bucket _num++ ; 
else 

Gen_Partition(C,T,bucket_num) ; 
second_scan := true; 
end if 
end if 
end for 

// Phase 2 : Computing the Exact Value of Candidates 
if second_scan == false then 
Print_Results(C,T) ; 
else 

repeat 

read as many candidates as counter buckets to C; 
scan R and update C; 

Print_Results (C ,T) ; 
until there are no more candidates; 
end if 

end 

In phase 1, a tuple d is read from relation R. The treatment of d varies accord- 
ing to whether or not the counter bucket for d.target exists and whether or not 
there are any empty buckets in counter. If the bucket for d.target exists, sumjval 
and cnt-val in the bucket will be updated. If the bucket for d.target doesn’t exist, 
but there are some empty buckets, the bucket for d.target will be generated. If 
neighter the bucket for d.target nor any empty buckets exist, Gen_Partition() will 
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be called. This procedure performs partitioning, which means selecting candi- 
dates, writing them, and resetting counter. However Gen_Partition() is different 
between in BAP and in POP. It’ll be discussed more intensively in section 4.2 
and 4.3. Such steps are repeated until R ends. 

The phase 2 is to compute the exact average value of candidates. If the 
number of partitions is one, then all results can be returned from counter imme- 
diately. Otherwise, candidates are loaded into memory and R is scanned again 
to compute sum-val and cntjval of candidates. If there are too many candidates 
for buckets, these steps will be over and over. Each scan will return some part 
of results. 



4.2 Algorithm BAP 

Procedure Gen_Partition. When an empty bucket is needed but doesn’t 
exist, this procedure in BAP performs partitioning, which means selecting can- 
didates, writing them, and resetting counter. According to theorem 1, targets, 
of which average values satisfy given threshold in a partition, are selected as 
candidates. These are written into disk, and counter is reset. The details are 
shown as follows. 

Procedure Gen_Partition // BAP algorithm 
input : Counter C, Threshold T, bucket_num 
begin 

write targets above T in C to disk; 
reset C; 

bucket _num := 0; 

end 



average threshold = 1000, the number of counter buckets = 2 



relation after reading 3rd tuple 




Fig. 3. BAP during Phase 1 
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Example. Fig. 3. shows an example of partitioning in BAP. When fourth tuple 
<doctor, 1200> is read, first partitioning takes place. Only the bucket for target 
{professor} satisfies threshold, so {professor} is written to disk. Then, all buckets 
are reset, new partition begins, and fourth tuple can be treated. These steps are 
repeated until the relation ends. 

4.3 Algorithm POP 

Procedure Gen_Partition. The idea of POP is to put off partitioning. When 
an empty bucket is needed but doesn’t exist, all buckets below threshold are 
reset instead of writing candidates at once. Therefore, empty buckets increase 
and new tuple can be read again. If no bucket is reset, that is, targets in all 
buckets are candidates, then partitioning happens. The details are shown as 
follows. 

Procedure Gen_Partition // POP algorithm 
input : Counter C, Threshold T, bucket_num 
begin 

reset buckets below T in C; 
if there is no bucket to reset then 
write all targets in C to disk; 
reset C; 

bucket _num := 0; 
end if 

end 



Example. Fig. 4. shows an example of partitioning in POP. When fourth tu- 
ple is read, neither the bucket for {doctor} nor an empty bucket exists. So, 
{manager} bucket below threshold is reset. Now, it’s possible to treat fourth tu- 
ple. Remember that new partition began this time in BAP. When eighth tuple 
is read, the bucket for {manager} doesn’t exist in counter, there is no empty 
bucket, and all targets in counter meet threshold. Then, partitioning happens 
and all targets are written to disk. 

4.4 Analysis 

Now, we’ll analyze BAP and POP. To postpone partitioning in POP affects the 
number of partition and the domain size of each partition. First, the number of 
partition in POP doesn’t become larger than that in BAP. As the number of 
partition is smaller, the probability that peak data to appear decreases. Peak 
data mean that they are below threshold in a relation, but they meet threshold 
in a partition. The existences of peak data make candidates increase. The worst 
case of POP is when each partition is occupied with only peak data of which 
cnt-vals are one. Second, while the domain size of each partition except for the 
last is all the same in BAP, it isn’t in POP. Because non-candidates are reset 
and new tuple can be treated as partitioning is put off. 




Partitioning Algorithms for the Computation of Average Iceberg Queries 283 



average threshold = 1000, the number of counter buckets = 2 

relation after reading 3rd tuple reset non-candidates 





Fig. 4. POP during Phase 1 



5 Experiments 

Our experiments were run in an Ultrasparc-IIi 333MHz running Solaris 7, with 
512MBs of RAM and ISGBs of local disk space. We implemented two algorithms 
in C-l— I- and generated four datasets. To evaluate the relative efficiency of them, 
we varied memory size, data order or threshold about four datasets. 



5.1 Datasets 

All generated datasets consist of 100,000,000 tuples. These datasets are about 
1.3GBs. The characteristics of each dataset rely on target distribution, rest dis- 
tribution, or data order. The description of each dataset is as follows. 

Dataset 1. The occurrences of targets follow normal distribution, and the val- 
ues of rests do uniform distribution. The domain size is 226,280. Max average 
value is about 999,000, and min is about 0. 

Dataset 2. This was generated by sorting dataset 1 in order of target. 
Dataset 3. The occurrences of targets follow uniform distribution, and the val- 
ues of rests do normal distribution. The domain size is 1,000,031. Max average 
value is about 21000, and min is about -19000. 

Dataset 4. This was generated by sorting dataset 3 in order of target. 



5.2 Memory Size 

We defined as domain ratio a ratio of the number of counter buckets to domain 
size. Domain ratio > 1.0 means counter buckets are a lot enough. 

the number of counter buckets 

domain ratio = ; ^ ^ 

domain size 

The first experiments compared BAP with POP as we varied domain ratio 
from 0.1 to 1.0. The results are shown in Fig. 5. and Fig. 6. In Fig. 5(a) and 
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Fig. 5. Execution time and Candidates in Dataset 1 



Dataset 3, Threshold = 14000 




Domain Ratio 



(a) Execution time 



Dataset 3, Threshold = 14000 




Domain Ratio 

(b) Candidates 



Fig. 6. Execution time and Candidates in Dataset 3 



Dataset 2, Threshold = 700000 




Domain Ratio 



(a) Execution time 



Dataset 4, Threshold = 14000 




Domain Ratio 

(b) Candidates 



Fig. 7. Execution time vs. Data Order 
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Dataset 1 , Domain Ratio = 0.5 



Dataset 3, Domain Ratio = 0.5 
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Fig. 8. Execution time varing on Threshold 



Fig. 6(a), it took about 2,000 seconds to scan the relation once. When domain 
ratio was from 0.4 or 0.5 to 0.9, AIQ were performed within two or three scans. 
POP was faster than BAP, and candidates were less in POP. When domain ratio 
was lower than 0.4, both BAP and POP played a role to find all unique data in 
phase 1. 

The distribution of targets result in a few differences between Fig. 5(a) and 
Fig. 6(a). The tuples with the same target scattered more in dataset 3 than 
in dataset 1 because the occurrences of targets follow random distirbution in 
dataset 3. So, peak data happened more. The opposite cases about this matter 
are shown in section 5.3. 



5.3 Data Order 

The second experiments were for examining effects of data order. The results 
are shown in Fig. 7. These figures seem to be one line, but in fact two lines are 
overlapped. Without regard to memory size, AIQ were always performed with 
scanning the relation only two times. BAP and POP were almost the same. 
If these results are compared with Fig. 5(a). and Fig. 6(a)., they are excellent 
even if each dataset is the same except data order. This results from the exact 
selection of candidates because each target gathers together in dataset 2 and 4. 



5.4 Threshold 

The last experiments were for examining the relation between threshold and 
execution time. The results are shown in Fig. 8. When threshold was high in 
both Dataset 1 and Dataset 3, POP was performed with scanning the relation 
only two times. As threshold became lower, more scans were needed. 
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6 Conclusion 

Until now, average functions have been out of consideration from iceberg query 
problem. In this paper, we introduced the theorem to select candidates, and pre- 
sented HKP {Basic Partitioning) algorithm and POP (Postponed Partitioning) 
algorithm based on this theorem to compute average iceberg queries efficiently. 
These algorithms used the strategy to partition a relation logically. Moreover, 
POP algorithm used the technique to postpone partitioning. 

Our experiments demonstrated that partitioning algorithms were affected by 
data order and memory size the most. If datasets are sorted, the performance 
is excellent without regard to memory size. Otherwise, memory size affects the 
performance very much. The comparison between two algorithms showed that 
POP was faster than BAP and candidates were smaller in POP when domain 
ratio was between 0.4 and 0.9, and they were almost the same when domain 
ratio was low. 
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Data warehouse [2, 4, 5, 6] is an integrated repository derived from multiple source 
(operational and legacy) databases. The data warehouse is created by either 
replicating the different source data or transforming them to new representation. This 
process involves reading, cleaning, aggregating and storing the data in the warehouse 
model. The software tools are used to access the warehouse for strategic analysis, 
decision-making, marketing types of applications. It can be used for inventory control 
of shelf stock in many departmental stores. 

Medical and human genome researchers can create research data that can be either 
marketed or used by a wide range of users. The information and access privileges in 
data warehouse should mimic the constraints of source data. A recent trend is to 
create web-based data warehouses and multiple users can create components of the 
warehouse and keep an environment that is open to third party access and tools. Given 
the opportunity, users ask for lots of data in great detail. Since source data can be 
expensive, 

its privacy and security must be assured. The idea of adaptive querying can be used to 
limit access after some data has been offered to the user. Based on the user profile, the 
access to warehouse data can be restricted or modified. 

In this talk, I will focus on the following ideas that can contribute towards warehouse 
security. 



1. Replication control 

Replication can be viewed in a slightly different manner than perceived in traditional 
literature. For example, an old copy can be considered a replica of the current copy of 
the data. A slightly out-of date data can be considered as a good substitute for some 
users. The basic idea is that either the warehouse keeps different replicas of the same 



* This research is partially supported by CERIAS and NSF under grant numbers 
9805693-EIA, CCR-990I712. 

Y. Kambayashi, M. Mohania, and A M. Tjoa (Eds.): DaWaK 2000, LNCS 1874, pp. 287-289, 2000. 

© Springer-Verlag Berlin Heidelberg 2000 




288 B. Bhargava 



items or creates them dynamically. The legitimate users get the most consistent and 
complete copy of data while casual users get a weak replica. Such replica may be 
enough to satisfy the user’s need but do not provide information that can be used 
maliciously or breach privacy. We have formally defined the equivalence of replicas 
[7] and this notion can be used to create replicas for different users. The replicas may 
be at one central site or can be distributed to proxies who may serve the users 
efficiently. In some cases the user may be given the weak replica and may be given an 
upgraded replica if wiling to pay or deserves it. Another idea related to this is the idea 
of witness that was discussed in [1]. 

2. Aggregation and Generalization 

The concept of warehouse is based on the idea of using summaries and consolidators. 
This implies that source data is not available in raw form. This lends to ideas that can 
be used for security. Some users can get aggregates only over a large number of 
records where as others can be given for small data instances. The granularity of 
aggregation can be lowered for genuine users. The generalization idea can be used to 
give users a high level information at first but the lower level details can be given 
after the security constraints are satisfied. For example, the user may be given an 
approximate answer initially based on some generalization over the domains of the 
database. Inheritance is another notion that will allow increasing capability of access 
for users. The users can inherit access to related data after having access to some data 
item. 

3. Exaggeration and Misleading 

These concepts can be used to mutilate the data. A view may be available to support a 
particular query, but the values may be overstated in the view. For security concern, 
quality of views may depend on the user involved and user can be given an 
exaggerated view of the data. For example, instead of giving any specific sales 
figures, views may scale up and give only exaggerated data. In certain situations 
warehouse data can give some misleading information [3] ; information which may be 
partially incorrect or difficult to verify the correctness of the information. For 
example, a view of a company’s annual report may contain the net profit figure 
including the profit from sales of properties (not the actual sales of products). 

4. Anonymity 

Anonymity is to provide user and warehouse data privacy. A user does not know the 
source warehouse for his query and warehouse also does not know who is the user and 
what particular view a user is accessing (view may be constructed from many source 
databases for that warehouse). Note that a user must belong to the group of registered 
users and similarly, a user must also get data from only legitimate warehouses. In 
such cases, encryption and middleware are to be used to secure the connection 
between the users and warehouse so that no outside user (user who has not registered 
with the warehouse) can access the warehouse. 
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5. User Profile Based Security 



User profile is a representation of the preferences of any individual user. User profiles 
can help in authentication and determining the levels of security to access warehouse 
data. User profile must describe how and what has to be represented pertaining to the 
users information and security level authorization needs. The growth in warehouses 
has made relevant information access difficult in reasonable time due to the large 
number of sources differ in terms of context and representation. Warehouse can use 
data category details in determining the access control. For example, if a new 
employee would like to access an unpublished annual company report, the warehouse 
server may deny access to it. The other alternative is to construct and return a view to 
her, which reflects only projected sales and profit. Such a construction of view may be 
transparent to the user. A server can also use the profile to decide whether the user 
should be given the access to associated graphical image with the data. The server has 
the option to reduce the resolution or the quality of images before making them 
available to users. 
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Abstract. In this paper, we propose a new algorithm named Inverted 
Hashing and Pruning (IHP) for mining association rules between words 
in text databases. The characteristics of text databases are quite dif- 
ferent from those of retail transaction databases, and existing mining 
algorithms cannot handle text databases efficiently, because of the large 
number of itemsets (i.e., words) that need to be counted. Two well- 
known mining algorithms, the Apriori algorithm [1] and Direct Hashing 
and Pruning (DHP) algorithm [5], are evaluated in the context of mining 
text databases, and are compared with the proposed IHP algorithm. It 
has been shown that the IHP algorithm has better performance for large 
text databases. 



1 Introduction 

Mining association rules in transaction databases has been demonstrated to be 
useful and technically feasible in several application areas [2,3], particularly 
in retail sales. Let X = {ii, i 2 , ■ ■ ■ , im} be a set of items. Let T> he a set of 
transactions, where each transaction T is a set of items, such that T C X. An 
association rule is an implication of the form A => Y, where X C X, Y C X, and 
An Y = ij)- The association rule A => Y holds in the database T> with confidence 
c if c% of transactions in T> that contain A also contain Y. The rule A => Y 
has support s if s% of transactions in T> contain A U Y. Mining association 
rules is to hnd all association rules that have support and conhdence greater 
than user-specihed minimum support (called mmsup) and minimum conhdence 
(called mtnconf) [1]. For example, beer and disposable diapers are items such 
that heer => diapers is an association rule mined from the database if the co- 
occurrence rate of beer and disposable diapers (in the same transaction) is higher 
than nnnsup and the occurrence rate of diapers in the transactions containing 
beer is higher than minconf. 

The hrst step in the discovery of association rules is to hnd each set of 
items (called itemset) that have co-occurrence rate above the minimum support. 
An itemset with at least the minimum support is called a large itemset or a 
frequent itemset. In this paper, the term frequent itemset will be used. The size 
of an itemset represents the the number of items contained in the itemset, and 
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an itemset containing k items will be called a fc-itemset. For example, {beer, 
disposable diapers} can be a frequent 2-itemset. Finding all frequent itemsets is a 
very resource consuming task and has received a considerable amount of research 
effort in recent years. The second step of forming the association rules from the 
frequent itemsets is straightforward as described in [1]: For every frequent itemset 
/, Rnd all non-empty subsets of /. For every such subset a, generate a rule of the 
form a => (/ — a) if the ratio of support}/ — a) to support(a) is at least minconf. 

The association rules mined from point-of-sale (POS) transaction databases 
can be used to predict the purchase behavior of customers. In the case of text 
databases, there are several uses of mined association rules. The association 
rules for text can be used for building a statistical thesaurus. Consider the case 
that we have an association rule 5 =y C, where B and C are words. A search 
for documents containing C can be expanded by including B. This expansion 
will allow for finding documents using C that do not contain C as a term. A 
closely related use is Latent Semantic Indexing, where documents are considered 
“close” to each other if they share a sufhcient number of associations [4]. Latent 
Semantic Indexing can be used to retrieve documents that do not have any terms 
in common with the original text search expression by adding documents to the 
query result set that are “close” to the documents in the original query result 
set. 

The word frequency distribution of a text database can be very different from 
the item frequency distribution of a sales transaction database. Additionally, the 
number of unique words in a text database is significantly larger than the number 
of unique items in a transaction database. Finally, the number of unique words 
in a typical document is much larger than the number of unique items in a 
transaction. These differences make the existing algorithms, such as the Apriori 
[1] and Direct Hashing and Pruning (DHP) [5], ineffective in mining association 
rules in the text databases. 

A new algorithm suitable for the mining association rules in text databases is 
proposed in this paper. The algorithm is named Inverted Hashing and Pruning 
(IHP), and is described in Section 3. The results of the performance analysis 
are discussed in Section 4. The new algorithm demonstrated significantly better 
performance than Apriori and DHP algorithms for large text databases. 

2 Text Databases 

Traditional domains for finding frequent itemsets, and subsequently the associ- 
ation rules, include retail point-of-sale (POS) transaction database and catalog 
order database [2]. The number of items in a typical POS transaction is well 
under a hundred. The mean number of items and distribution varies consider- 
ably depending upon the retail operation. In the referenced papers that provided 
experimental results, the number of items per transaction ranged from 5 to 20. 

The word distribution characteristics of text data present some scale chal- 
lenges to the algorithms that are typically used in mining transaction databases. 
A sample of text documents was drawn from the 1996 TReC [9] data collection. 
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The sample consisted of the April 1990 Wall Street Journal articles that were in 
the TReC collection. There were 3,568 articles and 47,189 unique words. Most of 
those words occur in only a few of the documents. Some of the key distribution 
statistics are: 

48.2% of the words occur in only one document; 

13.4% of the words occur in only two documents; 

7.0% of the words occur in only 3 documents. 

The mean number of unique words in a document was 207, with a standard 
deviation of 174.2 words. In this sample, only 6.8% of the words occurred in 
more than 1% of the documents. A single day sample of January 21, 1990 was 
taken as well. In that sample, there were 9,830 unique words, and 78.3% of the 
words occurred in three or fewer documents. 

The characteristics of this word distribution have profound implications for 
the efhciency of association rule mining algorithms. The most important impli- 
cations are: (1) the large number of items and combinations of items that need to 
be counted; and (2) the large number of items in each document in the database. 

It is commonly recognized in the information retrieval community that words 
that appear uniformly in a text database have little value in differentiating docu- 
ments, and further those words occur in a substantial number of documents [6]. It 
is reasonable to expect that frequent itemsets composed of highly frequent words 
(typically considered to be words with occurrence rates above 20%) would also 
have little value. Therefore, text database miners will need to work with item- 
sets composed of words that are not too frequent, but are frequent enough. The 
range of minimum and maximum support suitable for word association mining 
is not known at this time. 

The experiments described in this paper were conducted as part of a study 
for building a thesaurus from association rules. The thesaurus study was inter- 
ested in discovering associations between words in fairly close proximity in the 
text. The documents were broken into fragments of about 200 words each. The 
word frequency distribution of the document fragments was similar to the fre- 
quency distribution reported above, though not as skewed. Some key distribution 
statistics are: 

29.3% of the words occur in only one fragment; 

21.6% of the words occur in only two fragments; 

9.7% of the words occur in only 3 fragments. 

The mean number of unique words in a document fragment was 101, with a 
standard deviation of 46.4 words. In this sample, only 3.5% of the words occurred 
in more than 1% of the document fragments. 

3 Inverted Hashing and Pruning (IHP) Algorithm 

Inverted Hashing and Pruning (IHP) is analogous to the Direct Hashing and 
Pruning (DHP) [5] in the sense that both use hash tables to prune some of the 
candidate itemsets. In DHP, during the fc-th pass on the database, every (k + 1)- 
itemset within each transaction is hashed into a hash table, and if the count of 
the hash value of a (A; -f l)-itemset is less than the minimum support, it is not 
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considered as a candidate in the next pass. In IHP, for each item, the transaction 
identihers (TIDs) of the transactions that contain the item are hashed to a hash 
table associated with the item, which is named TID Hash Table (THT) of the 
item. When an item occurs in a transaction, the TID of this transaction is 
hashed to an entry of the THT of the item, and the entry stores the number 
of transactions whose TIDs are hashed to that entry. Thus, the THT of each 
item can be generated as we count the occurrences of each item during the Rrst 
pass on the database. After the first pass, we can remove the THTs of the items 
which are not contained in the set of frequent 1-itemsets, Fi, and the THTs of 
the frequent 1-itemsets can be used to prune some of the candidate 2-itemsets. 
In general, after each pass k > 1, we can remove the THT of each item that 
is not a member of any frequent fc-itemset, and the remaining THTs can prune 
some of the candidate (k + l)-itemsets. 

Consider a transaction database with seven items; A, B, C, D, E, F, and 
G. Figure 1 shows the THTs of these items at the end of the first pass. In our 
example, each THT has five entries for illustration purpose. Here we can see that 
item D occurred in five transactions. There were two TIDs hashed to 0, one TID 
hashed to 1, and two TIDs hashed to 4. 



Items 



A B 



C D E 



F 



G 




TID Hash Tables 

Fig. 1. TID Hash Tables at the end of the hrst pass 



If the minimum support count is 7, we can remove the THTs of the items B, 
D, E, and F as shown in Figure 2. Only the items A, C, and G are frequent and 
are used to determine C' 2 , the set of candidate 2-itemsets. Based on the Apriori 
algorithm, {A,C}, {A,G}, and {C,G} are generated as candidate 2-itemsets by 
paring the frequent 1-itemsets. However, in IHP, we can eliminate {A,G} from 
consideration by using the THTs of A and G. Item A occurs in 12 transactions, 
and item G occurs in 19 transactions. However, according to their THTs, they 
can co-occur in at most 6 transactions. Item G occurs in 5 transactions whose 
TIDs identifiers are hashed to 0, and item A occurs in no transactions that have 
such TIDs. Thus, none of those 5 transactions that contain G also contains A. 
Item A occurs in 3 transactions whose TIDs are hashed to 1 and item G occurs 
in 6 transactions with those TIDs. So, in the set of transactions whose TIDs are 











294 J.D. Holt and S.M. Chung 



Items 



0 

1 

Entries , 
2 

3 

4 



A B 



0 

3 

4 
0 

5 



C D 



0 

0 

5 

5 

3 



E 



F G 



5 

6 
3 
5 
0 



TID Hash Tables 

Fig. 2. TID Hash Tables of frequent 1-itemsets 



hashed to 1, items A and G can co-occur at most 3 times. The other THT entries 
corresponding to the TID hash values of 2, 3, and 4 can be examined similarly, 
and we can determine items A and G can co-occur in at most 6 transactions, 
which is below the minimum support level. 

In general, for a candidate fc-itemset, we can estimate its maximum possible 
support count by adding the minimum count of the k items at each entry of 
their THTs. If the maximum possible support count is less than the required 
minimum support, it is pruned from the candidate set. 

The details of the IHP algorithm are presented below: 

1) Database = set of transactions; 

2) Items = set of items; 

3) transaction = {TID, {x \ x ^ Items})] 

4) Comment: Fi is a set of frequent 1-itemsets 
!>) F, = cf,] 

6) Comment: Read the transactions and count the occurrences of each item 

and create a TID Hash Table (THT) for each item using 
a hash function h 

7) foreach transaction t G Database do begin 

8) foreach item x in t do begin 

9) X. count + -f; 

10) x.THT[(h(t.TID)] + +] 

11) end 

12) end 

13) Comment: Form the set of frequent 1-itemsets 

14) foreach item i G Items do 

15) ifi .count /\Database\ > minsup 

16) then Fi = Fi U i; 

17) Comment: Find Fk, the set of frequent fc-itemsets, where k >2 

18) for (k = 2; F7,_i :^ <})] k + +) do begin 

19) foreach item i ^ any frequent (k — l)-itemset do 

20) remove z. THT] 
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21) Comment: C\ is the set of candidate fc-itemsets 

22 ) Ck = 

23) tempC'k = (f)', 

24) Comment: F^-i * F^-i is a natural join of 

Fk-i and F^-i on the hrst k — 2 items 

25) foreach x G {F^-i * F^-i} do 

26) if ->3y \ y = (k — l)-subset of * A y ^ F^-i 

27) then tempC'k = tempC'k U x] 

28) Comment: Prune the candidate fc-itemsets using the THTs 

29) foreach x G tempC'k do 

30) if Get Max PossihleC'ount(x)/\Datahase\ > minsup 

31) then Ck = Ck C x; 

32) Comment: Scan the transactions to count candidate fc-itemsets 

33) foreach transaction t G Database do begin 

34) foreach fc-itemset * in t do 

35) if X ^ C'k then x. count + +; 

36) end 

37) Comment: Fk is the set of frequent fc-itemsets 

38) Fk = 

39) foreach x ^ Ck do 

40) if X .count / \Datahase \ > minsup 

41) then Fk = Fk LI x; 

42) end 

43) Answer = Uk Fk] 



The formation of the set of candidate itemsets can be done effectively when 
the items in each itemset are stored in a lexical order, and itemsets are also 
lexically ordered. As specihed in line 25, candidate fc-itemsets, for k > 2, are 
obtained by performing the natural join operation Fk-i * Fk-i on the hrst k — 2 
items of the (A; — l)-itemsets in Fk-i assuming that the items are lexically ordered 
in each itemset [1]. For example, if F 2 includes {A, B} and {A, C}, then {A, B, 
C} is a potential candidate 3-itemset. Then the potential candidate A;-itemsets 
are pruned in line 26 by using the property that all the {k — l)-subsets of a 
frequent A;-itemset should be frequent {k — l)-itemsets [1]. Thus, for {A, B, C} 
to be a candidate 3-itemset, {B, C} also should be a frequent 2-itemset. To 
count the occurrences of the candidate itemsets efhciently as the transactions 
are scanned, they can be stored in a hash tree, where the hash value of each 
item occupies a level in the tree [1]. 

GetMaxPossihleC'ount(x) returns the maximum number of transactions 
that may contain A;-itemset x by using the THTs of the k items in x. Let’s 
denote the k items in x by x[l], x[2 ], . . . , x[k]. Then Get M ax P ossibleC ount{x) 
can be dehned as follows: 
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Get M ax PossihleC ount(itemset x) 
begin 

k = size(x); 

MaxPossibleCount = 0; 

for (j = 0; j < size( THT)-, j + +) do 

MaxPossibleCount += mm(x[l].THT[j], x[2].THT[j], . . . , x[k].THT[j]); 
return (MaxPossibleCount) 

end 



size(x) represents the number of items in an itemset x and size( THT) represents 
the number of entries in the THT. 

The Partition algorithm [7] used the list of TIDs of the transactions contain- 
ing each item, called a TID-list. Once the TID-list is generated for each item by 
scanning the database once, we don’t need to scan the database again, because 
the support count of any candidate itemset can be obtained by intersecting the 
TID-lists of the individual items in the candidate itemset. However, a major 
problem is that the size of the TID-lists would be too large to maintain when 
the number of transactions in the database is very large. On the other hand, the 
number of entries in the THTs can be limited to a certain value, and it has been 
shown that even a small number of entries can effectively prune many candidate 
itemsets. 

For further performance improvement, the IHP algorithm can be used to- 
gether with the transaction trimming and pruning method proposed as a part 
of the DHP algorithm [5]. The concept of the transaction trimming and pruning 
is follows: During the k-th pass on the database, if an item is not a member 
of at least k candidate fc-itemsets within a transaction, it can be removed from 
the transaction for the next pass (transaction trimming), because it cannot be 
a member of a candidate (k + l)-itemset. Moreover, if a transaction doesn’t 
have at least k + I candidate fc-itemsets, it can be removed from the database 
(transaction pruning), becuase it cannot have a candidate (k + l)-itemset. 

4 Performance Analysis of the IHP algorithm 

Some performance tests have been done with the IHP algorithm. The Rrst objec- 
tive was to assess the effect of the TID Hash Table (THT) on the performance. 
The second objective was to assess the impact of transaction trimming and prun- 
ing upon performance. To meet these objectives, we studied the performance of 
four miners, Apriori, DHP, IHP without transaction trimming and pruning (IH- 
PwoTTP), and IHP with transaction trimming and pruning (IHPwTTP). All 
four of these miners were derived from the same code base. All of the test runs 
were made on a 400 MHz Pentium-II machine with 384 Mbytes of memory. All of 
the miners were written in Java, and the IBM 1.1. 7A JVM was used. The JVM 
memory for objects was constrained (via the mr; parameter) to 256 Mbytes. The 
initial heap size was set to 256 Mbytes (via the ms parameter) to control the 
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effect of heap growth overhead on performance. For the IHP, the TID hash table 
(THT) size was 400 entries, and the hash table size for DHP was 500,000 entries 
for the runs comparing the four miners. Two databases of different sizes were 
used for the experiments. The minimum support level used in these tests was 1% 
of the text fragments, and words with occurrence rates higher than 20% were 
treated as stop words and ignored. 

The small data collection consists of two weeks of the Wall Street Journal 
with 1,738 documents broken into 5,954 fragments with a hie size of 5.7 Mbytes. 
There were 1,643 frequent 1-itemsets, 11,641 frequent 2-itemsets, and 2,621 fre- 
quent 3-itemsets. 

In Figure 3, we Rnd that for the small collection of text documents, DHP 
and both IHP algorithms outperform the Apriori algorithm. In this test, there 
were only 3 passes on the database. The Apriori algorithm evaluated 1,607,124 
candidate itemsets, the DHP algorithm evaluated 1,359,819 candidate itemsets, 
and the IHP algorithms evaluated 654,049 candidate itemsets. Compared to the 
Apriori, the DHP was able to achieve a 15% reduction in the number of candidate 
itemsets considered, and the IHP was able to achieve nearly a 60% reduction. 




Time (seconds) 

Fig. 3. 3 passes on 5,954 document fragments and 1,643 frequent items 



In Figure 4, we can see a similar performance pattern for a larger data collec- 
tion. This data collection consisted of three months of the Wall Street Journal 
in 34,828 document fragments with a file size of 34.5 Mbytes. There were 1,618 
frequent items, 11,140 frequent 2-itemsets, and 1,925 frequent 3-itemsets. 

The efficiency of pruning the candidate itemsets is depicted in Figure 5. The 
Apriori algorithm counted 1,308,153 candidate 2-itemsets in the second pass. 
The DHP algorithm, with a hash table of 500,000 entries, counted 1,291,461 
candidate 2-itemsets, which is about 1.3% reduction over Apriori. The IHP al- 
gorithms counted 820,071 candidate 2-itemsets, which is about 37.3% reduction 
over Apriori. 

Figure 6 compares the processing times of the four mining algorithms on a 
pass by pass basis. The DHP algorithm requires slightly more processing time 
in the first and second passes due to the overhead of generating the hash table 
during each pass. In the second pass, the two IHP algorithms require substan- 
tially less processing time than the DHP and Apriori algorithms. This is due to 
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Time (seconds) 

Fig. 4. 3 passes on 34,828 document fragments and 1,618 frequent items 




X Apriori 
+ DHP 
O IHP 



Fig. 5. Candidates counted in each pass 



the reduction in the number of candidate 2-itemsets. In the third pass, all four 
algorithms counted the same number of candidate 3-itemsets, and as a result the 
IHP without the transaction trimming and pruning (IHPwoTTP) takes the same 
amount of time as the Apriori. However, the IHP with transaction trimming and 
pruning (IHPwTTP) is faster than the DHP. On the third pass, both algorithms 
have the same number of candidate itemsets. The difference in performance is 
due to the transaction trimming and pruning performed in the second pass. 




X Apriori 
+ DHP 
O IHPwoTTP 
B IHPwTTP 



Fig. 6. Comparison of pass times 
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During the second pass, both DHP and IHPwTTP used the transaction trim- 
ming and pruning and rewrote the transactions with the candidate 2-itemsets. 
As a result, each transaction had an average of 1,449 candidate 2-itemsets for 
DHP, and an average of 1,324 candidate 2-itemsets for IHPwTTP. Thus, in the 
third pass, the DHP algorithm had more 3-itemsets to check against the set 
of candidate 3-itemsets than IHPwTTP. On the fourth and subsequent passes, 
both IHPwTTP and DHP processed with the same number of candidates and 
the same database size, so there is no big difference in their processing times. 

We evaluated the effect of the TID hash table (THT) size on the performance 
of IHP. Figure 7 shows the total processing time of the IHPwoTTP for three 
passes on the larger test database. The zero entry in the THT corresponds to the 
case of Apriori, and it is included for a comparison purpose. The total processing 
time of IHPwoTTP decreases as the size of the THT increases up to 400 entries, 
because a larger THT can prune more candidate itemsets. However, as the THT 
size increases, we need more memory space to store the THTs, and there are 
more THT entries to check in order to determine whether a candidate itemset 
can be pruned or not. In our test, when each THT has 500 entries, the total 
number of candidate itemsets was reduced further, but the speedup from this 
reduction was not sufficient to overcome the increase in processing time required 
to handle the additional THT entries. We also tested the sensitivity of the DHP 
performance to its hash table size. Doubling the hash table size to 1,000,000 
entries resulted only about 9% reduction in total processing time for the same 
test case. 




5 Conclusions 

This study has two main results. First, the study shows that the IHP algorithm 
is more effective for mining text databases than the existing Apriori and DHP 
algorithms. Secondly, the study shows that transaction trimming and pruning is 
an effective technique to use when mining text databases and has a synergistic 
effect when combined with reducing the number of candidate itemsets. 
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The IHP algorithms can effectively use available memory in the hrst pass to 
sharply reduce the number of candidate itemsets in the second pass. Since a lot 
of TID hash tables are pruned between the hrst and second passes, most of the 
memory used for holding the TID hash tables in the hrst pass is available to hold 
the candidate itemsets for the second and subsequent passes. In large part, this 
effect is a result of the distribution of word occurrences discussed in Section 2. 
The large number of words with very low occurrence rates results in a situation 
that the preponderance of the TID hash tables generated in the hrst pass are 
pruned prior to the initiation of the second pass. 

Transaction trimming and pruning reduces the work required to scan the 
transactions. This reduction is signihcant for text database processing, and is 
a signihcant part of the total processing time reduction provided by the DHP 
algorithm and the IHP with transaction trimming and pruning algorithm. The 
benehcial effect of transaction trimming and pruning for text data mining was 
found to have a larger effect than the candidate pruning alone. 

The candidate pruning provided by the IHP algorithm combines synergisti- 
cally with the improvement from transaction trimming and pruning to provide 
a superior level of performance. 
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Abstract. Data mining is becoming increasingly important since the size of 
databases grows even larger and the need to explore hidden rules from the 
databases becomes widely recognized. Ciurently database systems are 
dominated by relational database and the ability to perform data mining using 
standard SQL queries will definitely ease implementation of data mining. 
However the performance of SQL based data mining is known to fall behind 
specialized implementation and expensive mining tools being on sale. In this 
paper we present an evaluation of SQL based data mining on commercial 
RDBMS (IBM DB2 UDB EEE). We examine some techniques to reduce EO 
cost by using View and Subquery. Those queries can be more than 6 times 
faster than SETM SQL query reported previously. In addition, we have made 
performance evaluation on parallel database environment and compared the 
performance result with commercial data mining tool (IBM Intelligent Miner). 
We prove that SQL based data mining can achieve sufficient performance by 
the utilization of SQL query customization and database tuning. Keywords: 
data mining, parallel SQL, query optimization, commercial RDBMS 



1 Introduction 

Extracting valuable rules from a large set of data has attracted lots of attention from 
both researcher and business community. This is particularly driven by explosion of 
the information amount stored in databases such as Data Warehouses during recent 
years. In business world, many organizations begin to apply data mining techniques 
directly to raw transaction data /and some results such as unidentified buying 
patterns and credit card fraud indications are widely recognized. 

One method of data mining is finding association rule [1]. Basket data analysis is 
typical of this method. There are some approaches proposed to mine association rales, 
[1,2, 6, 9] some of them arc based on relational database standard SQL [3,7,8]. But this 
kind of mining is known as CPU power demanding application and it has to handle 
very large amounts of transaction data. Unfortunately SQL approach is reported to 
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have drawback in performance although it has many advantages such as seamless 
integration with existing system and high portability. [7] 

On the other hand recently most major commercial database systems have 
included capabilities to support parallelization although no report available about how 
the parallelization affects the performance of complex query required by association 
rule mining. This fact motivated us to examine how efficiently SQL based association 
rule mining can be parallelized and speeded up using commercial parallel database 
system (IBM DB2 UDB EEE). We propose two techniques to enhance association 
rule mining query based on SETM [3]. And we have also compared the performance 
with commercial mining tool (IBM Intelligent Miner). Our performance evaluation 
shows that we can achieve comparable performance with commercial mining tool 
using only 4 nodes. 

Some considerable works on effective SQL queries to mine association rale such 
as L7J and [8J didn’t examine the effect of parallelization. [4] and [5] have reported a 
performance evaluation on PC cluster as parallel platform. Comparison with natively 
coded programs is also reported. However we use currently available commercial 
products for the evaluation. 



2 Association Rule Mining Based on SQL 

An example of association rale mining is finding “if a customer buys A and B then 
90% of them buy also C” in transaction databases of large retail organizations. This 
90% value is called confidence of the rale. Another important parameter is support of 
an itemset, such as support ({A,B,C}), which is defined as the percentage of the 
itemset contained in the entire transactions. For above example, confidence can also 
be measured as support ({ A,B,C}) divided by support ({ A.B}). 

In our experiments we employed three type of SQL query. First of all the standard 
SQL query using SETM algorithm [3]. Second is the enhanced SQL query using view 
materialization technique. Third is another enhanced SQL query using subquery 
technique. 



2.1 SQL query using SETM algorithm 

Transaction data is transformed into the first normal form (transaetion ID, item). In 
the first pass we simply gather the count of each item. Items that satisfy the minimum 
support arc inserted into large itemsets table C_1 that takes form (item, item count). 
SETM employs temporary tables to reuse item combination in next pass. In first pass, 
transactions that match large itemsets arc preserved in temporary table R_1 . 

In other passes for example pass k, we first generate all lexicographically ordered 
candidate itemsets of length k into another temporary table RTMP_k by self-joining 
table R_(k-1) that contains k-1 length transaction data. Then we generate the count for 
those itemsets, itemsets that meet minimum support arc included into large itemset 
table C_k. Finally transaction data R_k of length k generated by matching items in 
candidate itemset table RTMP_k with items in large itemsets. 0n order to avoid 
excessive EO, we disable the log during the execution. 




SQL Based Association Rule Mining Using Commercial RDBMS 303 



2.2 Enhanced SETM query using view materialize technique 

SETM has to materialize its temporary tables namely R_k and RTMP_k. Those 
temporary tables are only required in the next pass and they are not needed for 
generating the rules. In fact, those tables can be deleted after execution of its 
subsequent pass. Based on this observation we could avoid materialization cost of 
those temporary tables by replacing the table creation with view. 



2.3 Enhanced SETM query using subquery technique 

We expect significant performance improvement with utilization of view, however 
view still requires time to access the system catalog and are holding locks to the 
system catalog table during creating views so we further use subquery instead of 
temporary tables. Therefore we embed the generation of item combinations into the 
query to generate large itemsets. 



3 Performance Evaluation 

3.1 Parallel Execution Environment 

In our experiment we employed commercial Parallel RDBMS: IBM DB2 UDB EEE 
version 6.1 on IBM UNIX Parallel Server System: IBM RS/6000 SP. 12 nodes make 
this system and using shared nothing achitccturc. Each node has POWER2 77Mhz 
CPU, 4.4GB SCSI hard disk, 256MB RAM and connected by High Performance 
Switch HPS with lOOMB/s network speed. 

We also used commercial data mining tool IBM Intelligent Miner on single node 
of RS/6000 SP for performance comparison with the SQL based data mining. 



3.2 Dataset 

We use synthetic transaction data generated with program described in Apriori 
algorithm paper [2] for experiment. The parameters used are : number of transactions 
200000, average transaction length 10 and number of items 2000. Transaction data is 
partitioned uniformly by hashing algorithm corresponds to transaction ID among 
processing nodes’ local hard disks. 



3.3 Performance comparison 

Figure. 1 shows the execution time on each degree of parallelization. On average, we 
can derive that View and Subquery SQL is about 6 times faster than SETM SQL 
regardless of the number of nodes. The result is also compared with the execution 
time of Intelligent Miner on single processing node. It is true that Intelligent Miner on 
single node with transaction data stored in flat file is much faster than the SQL 
queries. However, the View and Subquery SQL arc 50% faster than Intelligent Miner 
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on single node if the transaction data have to be read from RDBMS. We exemplified 
that we can achieve comparable performance of Intelligent Miner on single node with 
flat file by activating only 4 nodes when we used View and Subquery SQL. The result 
gives evidence for the effectiveness of parallelization of SQL query to mine 
association rule. 




Fig. 1. Execution time on parallel database environment 



The speedup ratio is shown in figure 2. This is also reasonably good, especially View 
and Subquery SQL are not being saturated as the number of processing nodes 
increased. That means they can be parallelized well. The execution is 1 1 times faster 
with 1 2 nodes. In parallel environment, network potentially becomes botleneck which 
degrades the speed-up ratio. However our experiments suggest that association rule 
mining using variants of SETM is mostly CPU bound and network I/O is negligible. 




Fig. 2. Speedup ratio in parallel database environment 
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3.4 Analysis of execution trace 

In this section, we compared the three variations of SQL query described before. The 
performance evaluation is done on 12 nodes. 

The mining is two passes long. It is well known that in most cases the second pass 
generates huge amount of candidate itemsets thus it is the most time consuming 
phase[4][5]. Our results are very much alike. Almost over 80% of execution time 
belongs to PASS2 in all three SQL queries. Obviously View and Subquery SQL 
complete their first and second passes faster than SETM SQL query. 

We have recorded the execution traces of the three SQL in each PASS. The 
decomposition of execution time is analysed as shown Figure 3 (PASS2) respectively. 
Comparing the elapsed time with the cpu time at Fig 3, we find that both are close for 
View SQL and Subquery SQL. This means these SQL’s are cpu bound, while SETM 
SQL is not cpu bound. Most of execution time of SETM query is dominated by disk 
write time for creating temporary table such as R_k and RTMP_k. We can also see 
that sort time is almost equal for all three SQL’s, which represents the cost of group 
by aggregation. 

In PASS2, SETM reuses item combinations in temporary table R1 on the secondary 
storage that is generated in PASSl . We replace it with view or subquery. Then data is 
transferred directly through memory from PASSl to PASS2. Figure 3 indicates that 
PASS2 of those modified SQL queries only read data from buffer pool. Thus the disk 
write time of View SQL and Subquery SQL is almost negligible, although it is 
dominait for SETM SQL. This analysis clarifies the problem of SETM and how to 
cost can be reduced for View and Subquery SQLs, which is the key to the 
performance improvement. 
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Fig. 3. Decomposition of execution time of PASS2 for three types of SQL queries 
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4 Summary and Conclusion 

The ability to perform data mining using standard SQL queries will benefit data 
warehouses with the better integration with commercial RDBMS. It also allows easier 
porting codes among different systems. 
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In this paper, we reported the parallelization of SQL query to mine association rule 
on commercial RDBMS (IBM DB2 UDB EEE). We showed that good speedup ratio 
can be achieved, that means it is parallelized well. We also examined two variations 
of SETM SQL queries to improve performance, which reduce I/O cost by using View 
materialize or Subquery technique, and can achieve performance more than 6 times 
faster than SETM SQL query with two passes. We expect still more improvement 
with more passes. We’d like to report it next time. 

We have compared the parallel implementation of SQL based association rule 
mining with commercial data mining tool (IBM Intelligent Miner). Through real 
implementation, we have showed our improved SETM query using View or Subquery 
can beat the performance of specialized tool with only 4 nodes while original SETM 
query needs more than 24 processing nodes to achieve the same performance. We 
don’t have to buy expensive data mining tool, since parallel execution of SQL comes 
at no extra cost. It is also extremely easy to implement and flexible. 

There remain lots of further investigations. We’d like to check the performance 
with more passes situations. In addition we plan to do using more large transaction 
data. And we’d like to investigate the effect of intra parallelism under SMP 
environment. 
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Abstract. We investigate ways to support interactive mining sessions, 
in the setting of association rule mining. In such sessions, users specify 
conditions (filters) on the associations to be generated. Our approach 
is a combination of the incorporation of filtering conditions inside the 
mining phase, and the filtering of already generated associations. We 
present several concrete algorithms and compare their performance. 



1 Introduction 

The interactive nature of the mining process has been acknowledged from the 
start [3]. It motivated the idea of a “data mining query language” [5-8, 10] and 
was stressed again by Ng, Lakshmanan, Han and Pang [11]. A data mining query 
language allows the user to ask for specific subsets of association rules. Efficiently 
supporting data mining query language environments is a challenging task. 

In this paper, working in the concrete setting of association rule mining, we 
consider a class of conditions on associations to be generated which should be 
expressible in any reasonable data mining query language: Boolean combinations 
of atomic conditions, where an atomic condition can either specify that a certain 
item occurs in the body of the rule or the head of the rule, or set a threshold on 
the support or on the conhdence. A mining session then consists of a sequence 
of such Boolean combinations (henceforth referred to as queries). 

We present the first algorithm to support interactive mining sessions effi- 
ciently. We measure efficiency in terms of the total number of itemsets that 
are generated, but do not satisfy the query, and the number of scans over the 
database that have to be performed. Specifically, our results are the following: 

1. The filtering achieved by exploiting the query conditions is non- redundant, 
in the sense that it never generates an itemset that, apart from the minimal 
support and confidence thresholds, could give rise to a rule that does not 
satisfy the query. Therefore, the number of generated itemsets during the 
execution of a query, becomes proportional to the strength of the conditions 
in the query: the more specific the query, the faster its execution. 

2. Not only is the number of passes through the database reduced, but also 
the size of the database itself, again proportionally to the strength of the 
conditions in the query. 
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3. A generated itemset will, within a session, never be regenerated as a can- 
didate itemset: results of earlier queries are reused when answering a new 
query. 

The idea that filters can be integrated in the mining algorithm was ini- 
tially launched by Srikant, Vu, and Agrawal [12], who considered filters that are 
Boolean expressions over the presence or absence of certain items in the rules 
(hlters specifically as bodies or heads were not discussed). The algorithms pro- 
posed in their paper are not optimal: they generate and test several itemsets 
that do not satisfy the filter, and their optimizations also do not always become 
more efficient for more specific filters. 

Also Lakshmanan, Ng, Han and Pang worked on the integration of con- 
straints on itemsets in mining, considering conjunctions of conditions such as 
those considered here, as well as others (arbitrary Boolean combinations were 
not discussed) [9,11]. Of the various strategies for the so-called “CAP” algo- 
rithm they present, the one that can handle the filters considered in the present 
paper is their “strategy II.” Again, this strategy generates and tests itemsets 
that do not satisfy the hlter. Also their algorithms implement a rule-hlter by 
separately mining for possible heads and for possible bodies, while we tightly 
couple hltering of rules with hltering of sets. 

Both works do not discuss the reuse of results acquired from earlier queries 
within a session. 

This paper is further organized as follows. We assume familiarity with the 
notions and terminology of association rule mining and the Apriori algorithm 
[1,2]. In Section 2, we present a way of incorporating query-constraints inside 
a frequent set mining algorithm. In Section 3, we discuss ways of supporting 
interactive mining sessions. In Section 4, we present an experimental evaluation 
of our algorithms, and discuss their implementation. 



2 Exploiting Constraints 

As already mentioned in the Introduction, the constraints we consider in this 
paper are Boolean combinations of atomic conditions. An atomic condition can 
either specify that a certain item i occurs in the body of the rule or the head of 
the rule, denoted respectively by Body(i) or Head(t), or set a threshold on the 
support or on the conhdence. 

In this section, we explain how we can incorporate these constraints in the 
mining algorithm. We first consider the special case of constraints where only 
conjunctions of atomic conditions or their negations are allowed. 



2.1 Conjunctive Constraints 

Let bi, bi he the items that must be in the body by the constraint; 8^, . . . , 
b'^, those that must not; hi, hm those that must be in the head; and h'l, . . . , 
h'^, those that must not. 
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Recall that an association rule X ^ R is only generated if XUF is a frequent 
set. Hence, we only have to generate those frequent sets that contain every bi 
and hi, plus some of the subsets of these frequent sets that can serve as bodies 
or heads. Therefore we will create a set-hlter corresponding to the rule-filter, 
which is also a conjunctive expression, but now over the presence or absence of 
an item i in a frequent set, denoted by Set(i) and -'Set(i). We do this as follows: 

1. For each positive literal Body(z) or Head(z) in the rule-filter, add the literal 
Set(z) in the set-filter. 

2. If for an item i both ^Body(i) and ^Head(i) are in the rule-filter, add the 
negated literal -'Set(z) to the set-filter. 

3. Add the minimal support threshold to the set-filter. 

4. All other literals in the rule-filter are ignored because they do not restrict 
the frequent sets that must be generated. 

Formally, the following is readily verified: 

Lemma 1. An itemset Z satisfies the set- filter if and only if there exists itemsets 
X and Y such that XuY = Z and the rule X ^Y satisfies the rule-filter, apart 
from the confidenee threshold. 

So, once we have generated all sets Z satisfying the set-filter, we can generate 
all rules satisfying the rule-filter by splitting all these Z in all possible ways in a 
body X and a head Y such that the rule-filter is satisfied. Lemma 1 guarantees 
that this method is “sound and complete” . 

We thus need to explain two things: 

1. Finding all frequent Z satisfying the set-filter. 

2. Finding, for each Z, the frequencies of all bodies and heads X and Y such 
that XUY — Z and X => Y satisfies the rule-hlter. 



Finding the frequent sets satisfying the set-filter. Let Pos := {i \ Set(z) 
in set-filter} and Neg := {i \ -'Set(z) in set-filter}. Note that Pos = {bi, . . . ,be, 
hi, . . . ,hjn}- Denote the dataset of transactions by T>. We define the following 
derived dataset T>q: 

T>o := {t — {Pos U Neg) \t gV and Pos C t} 

In other words, we ignore all transactions that are not supersets of Pos and from 
all transactions that are not ignored, we remove all items in Pos plus all items 
that are in Neg. 

We observe: (proof omitted) 

Lemma 2. Let p be the absolute support threshold defined in the filter. Let 6b 
be the set of itemsets over the new dataset Vq, without any further eonditions, 
exeept that their support is at least p. LetS be the set of itemsets over the original 
dataset V that satisfy the set-filter, and whose support is also at least p. Then 



5 = {s U Pos I s G 5o}. 
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We can thus perform any frequent set generation algorithm, using only 2?o 
instead of T>. Note that the size of T>o is exactly the support of Pos in T>. Still 
put differently: we are mining in a world where itemsets that do not satisfy the 
filter simply do not exist. The correctness and optimality of our method is thus 
automatically guaranteed. 

Note however that now an itemset I, actually represents the itemset / U 
Posl We thus head-start with a lead of k, where k is the cardinality of Pos, in 
comparison with standard, non-filtered mining. 



Finding the frequencies of bodies and heads. We now have all frequent 
sets containing every bi and hi, from which rules that satisfy the rule- filter can 
be generated. Recall that in phase 2 of Apriori, rules are generated by taking 
every item in a frequent set as a head and the others as body. All heads that 
result in a confident rule, with respect to the minimal confidence threshold, can 
then be combined to generate more general rules. But, because we now only want 
rules that satisfy the filter, a head must always be a superset of {hi, . . . ,hm} 
and must not include any of the ft-' and bi (the latter because bodies and heads 
of rules are disjoint). In this way, we again head-start with a lead of m. Similarly, 
a body must always be a superset of {fti, . . . , &^} and may not include any of the 
ft' and hi- 

The following lemma tells us that these potential heads and bodies are al- 
ready present, albeit implicitly, in Sq: 

Lemma 3. Let So be as in Lemma 2. Let B (H) he the set of bodies (heads) of 
those assoeiation rules over T> that satisfy the rule-filter. Then 

R = {s U {fti , . . . ,be} I s G 5o and s n {b [, . . . , b{,, hi, ... , hm} = 0} 

and 

'H = {sU {hi, . . .,hm} I s G 5o and s n {h{, . . .,h'^, &i, ...,b(} = 0}. 

So, for the potential bodies (heads), we use, in So, all sets that do not include 
any of the ft' and hi {hi and bi), and add all bi (hi). Hence, all we have to do is 
to determine the frequencies of these subsets in one additional pass. (We do not 
yet have these frequencies because these sets do not contain either items h or 
hi, while we ignored transactions that do not contain all items bi and hi.) 

Each generated itemset can thus have up to three different “personalities:” 

1. A frequent set that satisfies the set-filter; 

2. A frequent set that can act as body of a rule that satisfies the rule-filter; 

3. Analogously for a head. 

We finally generate the desired association rules from the appropriate sets, 
on condition that they have enough confidence. 
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Example. We illustrate our method with an example. Assume we are given the 
rule- filter 



Body(l) A ^Body(2) A Head(3) A ^Head(4) 

A ^Body(5) A ^Head(5) A support > 1 A confidence > 50%. 

We begin by converting it to the set-filter 

Set(l) A Set(3) A -'Set(5) A support > 1. 

Hence Pos = {1,3} and Neg = {5}. Consider a database consisting of the 
three transactions {2, 3, 5, 6, 9}, {1, 2, 3, 5, 6} and {1, 3, 4, 8}. We ignore the first 
transaction because it is not a superset of Pos. We remove items 1 and 3 from 
the second transaction because they are in Pos, and we also remove 5 because 
it is in Neg. We only remove items 1 and 3 from the third transaction. After 
reading, according to Lemmas 1 and 2, the two resulting transactions, one of 
the itemsets we find in So is (4, 8}, which actually represents the set (1, 3, 4, 8}. 
It also represents a potential body, namely {1,4,8}, but it does not represent a 
head, because it includes item 4, which must not be in the head according to 
the given rule-filter. As another example, the empty set now represents the set 
{1,3} from which a rule can be generated. It also represents a potential body 
and a potential head. 



2.2 Arbitrary Boolean Filters 

Assume now given a rule-hlter that is an arbitrary Boolean combination of 
atomic conditions. We can put it in Disjunctive Normal Form^ and then gener- 
ate all frequent itemsets for every disjunct (which is a conjunction) in parallel 
by feeding every transaction of the database to every disjunct, and processing 
them there as described in the previous subsection. 

However, this approach is a bit simplistic, as it might generate some sets 
and rules multiple times. For example, consider the following hlter: Body(l) V 
Body (2). If we convert it to its corresponding set-hlter (disjunct by disjunct), 
we get Set(l) V Set(2). Then, we would generate for both disjuncts all supersets 
of {1,2}. We can avoid this problem by putting the set-filter in disjoint DNF.^ 
Then, no itemset can satisfy more than one set-disjunct. On the other hand, this 
does not solve the problem of generating some rules multiple times. Consider the 
equivalent disjoint DNF of the above set-filter: Set(l) V (Set(2) A -'Set(l)). The 
first disjunct thus contains the set {1, 2} and all of its supersets. If we generate 
for every itemset all potential bodies and heads according to every rule-disjunct, 
both rule-disjuncts will still generate all rules with the itemset {1, 2} in the body. 
The easiest way to avoid this problem is to put already the rule-filter in disjoint 
DNF. 

^ Any Boolean expression has an equivalent DNF. 

^ In disjoint DNF, the conjunction of any two disjuncts is unsatisfiable. Any boolean 
expression has an equivalent disjoint DNF. 
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Until now, we have disregarded the possible presence of negated thresholds 
in the filters, which can come from the conversion to disjoint DNF, or from the 
user himself. Due to space limitations, we defer their treatment to the full paper. 

3 Interactive Mining 

3.1 Integrated Filtering or Post-Processing? 

In the previous section, we have seen a way to integrate filter conditions tightly 
into the mining of association rules. We call this integrated filtering. At the 
other end of the spectrum we have post-processing, where we perform standard, 
non-filtered mining, save the resulting itemsets and rules, and then query those 
results for the hlter. 

Integrated hltering has obvious advantages over post-processing: 

However, as already mentioned in the Introduction, data mining query lan- 
guage environments must support an interactive, iterative mining process, where 
a user repeatedly issues new queries based on what he found in the answers of 
his previous queries. Now consider a situation where minimal support require- 
ments and data set particulars are favorable enough so that post-processing is 
not infeasible to begin with. Then the global, non-filtered mining operation, on 
the result of which the filtering will be performed by post-processing, can be 
exeeuted onee and its result materialized for the remainder of the data mining 
session (or part of it). 

In that case, if the session consists of, say, 20 data mining queries, these 20 
queries amount to standard retrieval queries on the materialized mining results. 
In contrast, answering every single of the 20 queries by an integrated hlter will 
involve at least 20, and often many more, passes over the data, as each query 
involves a separate mining operation. We can analyze the situation easily as 
follows. 

Consider a session in which the user issues a total of m data mining queries 
over a database of size n. Suppose that the total number of association rules 
(given a minimal support and conhdence requirement) over these data equals 
r. Let t be the time required to generate all these rules. Moreover, it is not 
unreasonable to estimate that in post-processing, each hlter executes in time 
proportional to r, and that in integrated hltering, each hlter executes in time 
proportional to n. Then the total time spent by the post-processing approach is 
t-\-m- r, while in the integrated hltering approach this is m • n. Hence, if n > r, 
we have proved the following: 

Proposition 1. The integrated filtering total time is guaranteed to grow beyond 
the post-processing total time after exactly m = \t/{n — r)] queries. 

3.2 Online Filtering: Basic Approach 

From the above discussion it is clear that we should try to combine the ad- 
vantages of integrated hltering and post-processing. We now introduce such an 
approach, which we call online filtering. 
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In the online approach, all rules and itemsets that result from every posed 
query, as well as all intermediate generated itemsets, are saved incrementally. 
Initially, when the user issues his first query, nothing has been mined yet, and 
thus we answer it using integrated filtering. 

Every subsequent query is first converted to its corresponding rule- and set- 
filter in disjoint DNF. For every disjunct in the set-filter, the system checks all 
currently saved itemsets. If an itemset satisfies the disjunct, it is added to the 
data structure holding itemsets, that is used for mining that disjunct, as well 
as all of its subsets that satisfy the disjunct (note that these subsets may not 
all be saved; if they are not, we have to count their supports during the first 
scan through the dataset). We also immediately add all candidate sets, and if 
they were already saved, we add their support, so that they need not to be 
regenerated and recounted. 

If no new candidate sets can be generated, this means that all necessary sets 
were already saved, and we are done. However, if this is not the case, we can now 
begin our filtered mining algorithm with the important generalization that in 
each iteration, candidate itemsets of different cardinalities are now generated. In 
order for this to work, candidate itemsets that turn out to be non-frequent must 
be kept so that they are not regenerated in later iterations. This generalization 
was first used by Toivonen in his sampling algorithm [13]. 

Saving all generated itemsets and rules gives us another advantage that can 
be exploited by the integrated filtering algorithm. Consider a set-filter stating 
that item 1 and 2 must be in the set. In the first pass of the algorithm all single 
itemsets are generated as candidate sets over the new dataset Vq (cf. Section 2.1). 
We explained that these single itemsets actually represent supersets of {1,2}. 
Normally, before we generate a candidate set, we check if all of its subsets are 
frequent. Of course, this is impossible if these subsets do not even exist in Vq. 
Now, however, we can check in the saved results for a subset with too low support; 
if we find this, we can avoid generating the candidate. 

For rule generation, the same techniques apply. We thus obtain an algorithm 
which reuses previously generated itemsets and rules as if they had been gener- 
ated in previous iterations of the algorithm. We are optimal in the sense that we 
never generate and test sets or rules that were generated before. 

We also note that techniques for dealing with main memory overflow in the 
setting of standard, non- filtered, non- interactive association rule mining [14], 
remain valid in our approach. 

3.3 Online Filtering: Improvements 

In the worst case, the saved results do not contain anything that can be reused 
for answering a filter, and hence the time needed to generate the rules that 
satisfy the filter is equal to the time needed when answering that filter using the 
integrated filtering approach. In the best case, all requested rules are already 
saved, and hence the time needed to find all rules that satisfy the filter is equal 
to the time needed for answering that filter using post-processing. In the average 
case, part of the needed sets and rules are saved and will then be used to speed 
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up the integrated filtering approach. If the time gained by this speedup is more 
than the time needed to find the reusable sets and rules, then the online approach 
will always be faster than the integrated filtering approach. In the limit, all rules 
will be materialized, and hence all subsequent filters will be answered using 
post-processing. 

Could it be that the time gained by the speedup in the integrated filtering is 
less than the time needed to find the reusable sets and rules? This could happen 
when a lot of sets and rules are already saved, but almost none of them satisfies 
the filter. We can however counter this phenomenon by improving the speedup. 
The improvement is based on estimating what is currently saved, as follows. 

We keep track of a set-filter 4>sets which describes the saved sets, and of a 
rule-hlter ij^ruies which describes the saved rules. Both filters are initially false. 
Given a new query (rule-filter) the system now goes through the following 
steps: (step 3 was described in section 2.1) 

1- famine if f\ ~'1prules 

2. Iprules ifrules V if 

3. Convert the rule-filter if mine to the set-filter (f 

4. (fmine ^ ~'flsets 

5. 4>sets ■— 4*sets V (f 

After this, we perform: 

1. Generate all frequent sets according to (fmine, using the basic online ap- 
proach. 

2. Retrieve all saved sets satisfying <f A ^(fmine- 

3. Add all needed subsets that can serve as bodies or heads. 

4. Use if mine to generate rules from the sets of steps 1 and 2. 

5. Retrieve all saved rules satisfying if. 

Note that the filter (fmine is much more specific than the original filter (f. 
We thus obtain the improvement in speedup from integrated filtering, which we 
already pointed out to be proportional to the strength of the filter. 

3.4 Avoiding Exploding Filters 

The improvement just described incurs a new problem. The formula ifmies (or 
(fsets) becomes longer with the session. When, given the next filter if, we mine 
for if A ~^if rules, we will convert to disjoint DNF which could explode. 

To avoid this, consider ifmies in DNF: ifiV ■■■ V if n. Instead of the full filter 
if A -nif rules , we are going to use a filter if A ^if'mies ) where ifmies obtained 
from ifmies by keeping only the least restrictive disjuncts ifi (their negation will 
thus be most restrictive). In this way if A ~^ifmies i® short. 

But how do we measure restrictiveness of a ifil Several heuristics come to 
mind. A simple one is to keep for each ifi the number of saved sets that satisfy 
it. These numbers can be maintained incrementally. 
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4 Experimental Comparison 



For our experiments, we have implemented an extensively optimized version of 
the Apriori algorithm, equipped with the hltering optimizations as described in 
the previous sections. 

We experimented with a session of 40 queries using the integrated filtering 
approach, the post-processing approach and the online approach. The transac- 
tion database was synthetically generated using the program provided by the 
Quest research group at IBM Almaden and contained 1 000 000 transaction over 
10 000 items. The performance figures are shown in Figure 1. 

The first 20 queries all require different items in the rules such that the online 
approach is just a little faster than the integrated filtering approach, because it 
cannot reuse that much pre-generated sets and rules. From there on the online 
approach has already collected a lot of sets and rules that can be reused. After 
the 20th query we can see some improvement on this until the 30th query, where 
the online approach has collected all sets and rules needed to answer further 
queries and hence the time needed to answer these queries is equal to the time 
needed to answer these queries using the post-processing approach. 




Fig. 1. Experiments. Time is in seconds. 
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1 Introduction 

In many daily transactions, the time when an event takes place is known and 
stored in databases. Examples range from sales records, stock exchange, patient 
records, to scientific databases in geophysics and astronomy. Such databases 
incorporate the concept of time which describes when an event starts and ends 
as historical records [9]. The temporal nature of data provides us with a better 
understanding of the trend or pattern over time. In market-basket data, we 
can have a pattern like “75% of customers buy peanuts when butter starts to 
be in big sales and before bread is sold out” . We observe that there may be 
some correlations among peanuts, butter and bread so that we can have better 
planning for marketing strategy. Knowledge discovery in temporal databases 
thus catches the attention of researchers [8,4]. 

Previous work for knowledge discovery in temporal data includes [2,6,5]. 
These techniques only treat data as series in chronological order and mainly 
support point-based events. Therefore, the physical ordering of events would be 
quite simple and the expressive power in specifying temporal relations such as 
during, overlaps, etc. is limited. When we deal with temporal data for events 
which last over a period of time, besides parallel and serial ordering of event se- 
quences, we may find more interesting temporal patterns. For instance, patterns 
like “event A occurs during the time event B happens” or “event A’s occurrence 
time overlaps with that of event B and both of these events happen before event 
C appears” cannot be expressed as simple sequential orders. On the other hand, 
temporal logic [7] is suggested to be used for expressing temporal patterns de- 
fined over categorical data. Temporal operators such as since, until and next are 
used. We may have patterns like “event A always occurs until event B appears”. 
Simple ordering of events is considered. 

In this paper, we consider interval-based events where the duration of events 
is expressed in terms of endpoint values, and these are used to form temporal 
constraint in the discovery process. We introduce the notion of temporal repre- 
sentation which is capable of expressing the relationships between interval-based 
events. We develop new methods for finding such interesting patterns. 

Y. Kambayashi, M. Mohania, and A M. Tjoa (Eds.): DaWaK 2000, LNCS 1874, pp. 317-326, 2000. 

© Springer-Verlag Berlin Heidelberg 2000 
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2 Problem Statement 

Suppose we are given a temporal database V , which stores a list of clinical 
records. Each record contains a person-id, name of disease and a pair of ordered 
time points tg and tg where tg < tg and both are integers. They are the start 
time and end time, indicating the period during which the patient contracted 
certain kind of disease. 

A record may have other attributes, for simplicity, we consider here a single 
temporal attribute and denote it as an event. We assume a set E of event types. 
An event E has an associated time of occurrence and it is specihed by a triple 
(A,tg,tg), where A G E is an event type and tg and tg are the start time and 
the end time, respectively. We also use E.tg and E.tg to indicate these times. 

We assume that each event is associated with one person. A sequence of 
events is dehned as a list of events where each event is associated with the 
same person such that for person j, we have the following sequence Sj, where 
Sj = ((-4i,bi Aei), (^2,b2,^e2)---(-4„,b„ Ae„))- The cvcnts are ordered by the 
end times where tg, < for all i = 1, ..., n — 1. 
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Fig. 1. The thirteen possible relationships between two intervals X and Y 



Relationship among time intervals can be described in different ways. Allen’s 
taxonomy of temporal relationships [3] is adopted to describe the basic 
relationships among events and summarized in Figure 1. The relations between 
intervals can be expressed in terms of relations between their endpoints, we call 
these the endpomt constraints. For instance, consider the sequence (£'i = (A,5,10), 
E 2 =(B,8,12)), we have “A overlaps B” since Ei.tg < E^-ig, E\.tg > E^-ig and 
Ei.tg < E2-tg. 

These thirteen relationships can be used to express any relationship held 
between two intervals and they provide a basis for the description of temporal 
patterns. Some of the relations are mirror image of the other, for example, “X 
overlaps Y” implies the same relation as “Y is overlapped-by X” . We only focus 
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on seven primitive temporal relations with the order of items preserved. The 
seven relations are shown in the shaded area in Figure 1. Let us call this set 
of seven temporal relations Rel. We obtain binary temporal predicates if we 
consider two events only. To express complex relation among more events in a 
sequence, we have the following dehnitions: 

Definition 1. A temporal pattern is defined recursively as follows: 

— If E IS a single event type in E, then E is a temporal pattern, it is also called 

an atomic pattern. 

— If X and Y are two temporal patterns, and rel G Rel, then [X rel Y) is 
also a temporal pattern. This is a composite pattern. 

The size of a temporal pattern is the number of atomic patterns in the pattern. 

Definition 2. An atomic temporal pattern X has a mapping in a sequence S 
if we can find an event E of type X in the sequence. We denote this mapping 
by Ai[X,S) = {E}. We associate a time duration to the mapping as follows: 
A4{X, S).ts = E.tg, A4{X, S).te = E.tg We say that X holds in S. 

A composite pattern [X rel Y) in which Y is an atomic pattern and rel G 
Rel has a mapping AA[[X rel T),^) m a sequence S if X has a mapping 
Ai{X,S) in S and we can find an event E (f Ai[X,S) of type Y in S to be 
mapped as Ai(Y,S) such that if we consider some imaginary event Z with start 
time of AA[X,S).ts and end time of AA[X, S).te, then Z rel E is true. 

In this case, Xi(X rel Y,S) = Xi(X,S) U {E}. We say that the relation 
[X rel Y) holds in S. The mapping AA[[X rel T),^) has an associated time 
interval given by 

M((X rel Y),S).t, = min{M{X , S) .t, , M{Y, S) .t,} 

M((X rel Y),S).te = M{Y, S).te 

In the above mapping of a composite pattern in a sequence, nnion of two time 
intervals takes place. For example, for the sequence shown in Figure 2(a), we 
obtain a composite pattern “(A overlaps B) before C” with the resultant interval 
being [5,18]. 




((A overlaps B) before C) overlaps D (A overlaps B) before (C during D) 



(a) (b) 



Fig. 2. Different temporal pattern representation 
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We can see that in Figure 2, two different composite patterns “((A overlaps 
B) before C) overlaps D” and “(A overlaps B) before (C during D)” hold in 
the sequence by considering different ways of combinations. The number of such 
possible descriptions is exponential in the sequence size. We restrict our interest 
to temporal patterns of the form ((...(Ai reli A 2 ) re /2 As)--- relk-i Ak). We call 
these the A1 temporal pattern. 

There are two main reasons for this restriction: (1) We believe that the tem- 
poral relations give some insight into causal relationships. As such, when a few 
events have happened, together they may become the cause of a following event. 
(2) Discovering all possible patterns can be computationally inhibitive and also 
the amount of results to the user can be overwhelming. We have implemented 
the mechanism to discover temporal pattern obtained by appending compos- 
ite pattern of size two at a time as shown in Figure 2(b). Experimental results 
show that even a small extension in this way would lead to a much increase in 
computational cost. 

Moreover, user can specify the maximum length of time interval that is of 
interest, known as the window-size, winsize. The intuition is that if some 
events do not happen close enough to each other, we would not be interested 
to hud any correlation between them. If w is a given window-size, a temporal 
pattern P holds within w in a sequence S if there is a mapping A4(P, S) such 
that M{P, S).te — M{P, S).ts < w. 

Definition 3. Let Ai, i = 1, ...,k be a bag of k event types m E, reli £ Rel, 
t = l,..,k-l, a fc-item has the form {{Ai, A 2 , ■■■Ak} , {reh, reh, ■■■relk-i} ,V} , 
where V is a temporal pattern m terms of the events types Ai, A 2 , ■■■, Ak and the 
relations reli , ■■■, relk-i, and k > !■ 

Given a window-size w, let M be a subset of the set of events in a sequence 
S, M snpports an fc-item {{Ai,A 2 , ■■■Ak},{reh,rel 2 , ■■■relk-i},V} if M is a 
mapping for the temporal pattern P and M.G ~ M.G < w. 

The snpport of a temporal pattern T* in a set of sequences V is defined as 

-Ts\ \{M C S\S eV,M supports V}\ 

■support (P,V) = — 

The snpport of a fc-item is the support of the temporal pattern in the fc-item. 

A large fc-item is a fc-item having support greater than a threshold, namely 
minsup provided by users, that is, support[V ,V) > minsup^ Our aim is to 
find the large fc-items for all fc. 

3 Mining AI Temporal Patterns 

Here we propose a method for mining the frequent AI temporal patterns as 
shown in Figure 2(a). Consider a patient database shown in Table 1. Each tuple 
contains a person-id, the disease contracted by the patient and the duration of 
the disease. The database can be used to find if some diseases are likely to cause 
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person-id 


disease 


start 


end 


1 


A 


5 


10 


1 


B 


8 


12 


1 


D 


14 


20 


1 


C 


16 


18 


1 


B 


17 


22 


2 


A 


8 


14 


2 


B 


9 


15 


2 


D 


16 


22 


2 


C 


19 


21 


3 


A 


12 


20 


3 


B 


22 


25 


3 


D 


28 


32 


3 


C 


29 


31 



Original database 



person-id 


seq-list 


1 


(A,5, 10), (B,8,12),(D, 14,20), 
(C, 16, 18), (B, 17,22) 


2 


(A,8,14),(B,9,15), 
(D, 16, 22), (C, 19, 21) 


3 


(A, 12, 20), (B, 22, 25), 
(D, 28, 32), (C, 29, 31) 



seqjist 



item 


A 


B 


D 


C 




(1,5,10) 

(2,8,14) 

(3,12,20) 


(1,8,12) 

(1,17,22) 

(2,9,15) 

(3,22,25) 


(1,14,20) 

(2,16,22) 

(3,28,32) 


(1,16,18) 

(2,19,21) 

(3,29,31) 



item Jist 



Table 1. Transform the database as seqJist and itemJist 



some other diseases and their temporal relations. It is assumed that the min_sup 
is 66% and the win_size is set to be 100 time units. We use a different layout of 
event sequence that used in finding sequential pattern [2] . Instead of transforming 
the original database into a list of sequences, the seqjist, where each sequence 
corresponds to one person, we use an itemJist to represent the temporal data. 
Each event is associated with a list of person-id, start time and end time (pid, 
ts, te)- Tables 1 illustrates the differences between the two approaches. Note 
that each item is an atomic or composite pattern and the t., and % in the 
itemJist indicates an time interval for this pattern. Similar idea of transforming 
the database into the form which we called itemJist is used in [10]. 

A basic strategic similar to the Apriori-gen approach [1] is used. With seqjist, 
we need to scan the database once in each iteration. The itemJist approach 
avoids this problem since it enables us to count support by direct composition of 
the lists. The size of these lists and number of candidates shrink as the sequence 
length increases, which facilitates fast computation. This motivates us to choose 
itemJist format to store the large fc-items for efficient support counting. 

The general structure of the method for the mining process is mainly divided 
into two phases: (1) Candidate Generation and (2) Large k-items Gen- 
eration. It takes multiple passes over the data. In each pass, we start with a 
seed set of large items which is found in the previous pass. We use the seed set 
for generating new potentially large items, called candidates, by adding one 
atomic item to elements in the seed set. In the second phase, for each candidate, 
we examine the L^-i and Li itemJists and determine the temporal relations be- 
tween the composite pattern and atomic pattern that have sufficient supports. 
We then generate new large fc-items and obtain the Lk itemJist by merging 
the composite and atomic patterns with the temporal relation. The algorithm 
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terminates when we cannot find any large fc-items after the end of the current 
pass. 

Candidate Generation: 

The logical form of the candidates is shown in an example in Table 2. The 
candidate generation to obtain Ck from Lk-i is done by adding one large 1- 
item each time. We generate the candidates by combining the events in atomic 
patterns with those in composite patterns. 



item 1 


item 2 


A overlaps B 


C 


A overlaps B 


D 


A before B 


C 


A before B 


D 







item 


count 


item 


count 


item 


count 


A overlaps B 


2 


A before B 


2 


C during D 


3 


(1,5,12) 

(2,8,15) 


(1,5,22) 

(3,12,25) 


(1,14,20) 

(2,16,22) 

(3,28,32) 



Partial large 2-items 



The 3-candidates 

Table 2. Partial large 2-items and the 3-candidates 



In each pass, we generate Ck from Lk-i- We prune out some irrelevant can- 
didates as follows: let bi be the k — 1-item with {{..{aiirelriai2)...relrk-2(iik-i) 
where relri £ Rel if (anrelijaj) £ L'j or [aik-irelijUj) £ L'j for any relij then 
take the candidate element {bi, aj}. For instance, for the 2-item with a pattern 
of “C during D” , we aim to hnd any temporal relation between the 2-item and 
event A. In the previous pass, we have found that no pattern of the form “C 
rel A” or “D rel A” is large, we therefore exclude the possibility of having the 
candidate of “(C during D)” with “A” in the generation of large 3-item. 

Large fc-Items Generation: 

This phase is further divided into two subphases. They are the support count- 
ing phase and generation of large fc-items. First, we need to hnd the supports 
for the candidates generated. We determine the number of sequences that sup- 
port each temporal relation between the composite patterns in each candidate. 
We compare the endpoint values of elements in Lk-i and L\ and determine if 
any temporal relation holds between the composite and atomic patterns. Large 
fc-items are formed if their support is greater than the threshold given. 

To facilitate efficient counting, we use a hash tree to store L\ and also the 
relevant part of the l-itemJist and a hash tree to store Lk-i and part of the 
{k — lyUemJisi . We use the value of the event as a key for hashing in the hash 
tree for L\. For the hash tree for composite patterns, we use the values of all 
the events included and the temporal relations together to form a key by simple 
concatenation. The leaf nodes of the hash tree corresponds to some large fc-item 
I and it also records the mappings for the pattern in I . The mappings are stored 
in the form of iiemJisi, with the person-ids, start times and end times. For each 
candidate, we use the composite pattern and event as search keys and hnd from 
the hash trees the corresponding Lk-i and L\ items. 
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During the search in the hash tree, hashing is done in the internal nodes until 
we reach the leaf nodes where we perform exact matching for the composite or 
atomic pattern. We consider a pattern P for a large (k — l)-item and a pattern 
P' for a large 1-item. We can identify any temporal relation that holds between 
a mapping for P and a mapping in P’ , since the start times and end times are 
recorded in the corresponding hash trees. If some composite pattern is found, the 
count for the candidate with respect to the specihc temporal pattern is increased 
by 1. The counts are kept as auxiliary information to the table for the candidate 
items. There are seven counts for each candidate, one for each of the temporal 
relationships. 

The second subphase is to form large fc-items. Table 2 shows a partial set of 
the large 2-items. After identifying any temporal relation between the items in 
Ck, we generate Lk from L^-i and Li with the corresponding temporal relation. 
Each new item in Lk is a composite pattern and is used for the next pass. The 
resultant interval is obtained from the union of two intervals of [k — l)-item and 
1-item. For instance, as shown in Table 2, the start time and end time of the 
mapping of composite pattern “A overlaps B” are {[5,12], [8,15]}. 

4 Mining Other Temporal Patterns 

In the previous section, in mining the A1 temporal pattern, we generate large k- 
items by adding one atomic pattern in Ti at a time. Here we consider a slightly 
more complex form of temporal patterns which we call A2 temporal pattern. 
The A2 temporal pattern is dehned recursively as the following: (1) a temporal 
pattern of size 2 is an A2 temporal pattern. E.g. “A before B” . (2) if X is an A2 
temporal pattern, and Y is a temporal pattern of size 2, then (A rel Y) where 
rel G Rel is also an A2 temporal pattern. Example of such a composition is 
shown in Figure 2(b). The patterns we generate are in even number of events, 
i.e. 2fc-items. We therefore focus on temporal relations among events by adding 
one 2-item each time. We shall refer to the method in Section 3 as the first 
method and the method described in this section as the second method. The 
second method works similarly as the previous method except for the candidate 
generation phase and the formation of large 2fc-items. 

Candidate Generation: 

In the formation of C* 2 , the process is the same as before. Next we start to 
generate C' 2 k, where k > 2, we examine T 2 fe -2 and L 2 and use compositions of 
the elements in the two itemJist of L 2 k -2 and L 2 - When we prune any irrelevant 
candidates in this phase, we need to consider two newly added atomic patterns 
this time, say aji and aj 2 , where ajireljGj 2 G ^ 2 - The two added items can 
be combined with a composite pattern, say hi, where hi G T 2 fe -2 if both of the 
following conditions hold: 

1. there is a relation between the leftmost atomic pattern of hi and at least one 

of Gji and aj 2 - 

2. there is a relation between the rightmost atomic pattern of hi and aji or aj 2 - 
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Large 2fc-Items Generation: 

We also divide this into two phases namely support counting and the genera- 
tion of large 2fc-items. In general, the second method works in the same manner 
as the hrst method in that we generate incrementally larger 2fc-items by com- 
bining the two composite patterns of a 2fc-candidate. The difference is that we 
shall use the itemJist of L 2 k -2 and L 2 - 

5 Performance Evaluation 

To evaluate the performance of the proposed methods over a large range of 
data, we conducted several experiments on an UltraSparc 5/270 workstation 
with 520MB of main memory. The two methods in Sections 3 and 4 are written 
in C. We consider synthetic data in an application domain of a medical database 
same as that of the given example. For each person we record a sequence of 
clinical records stating the different diseases contracted. 




Fig. 3. Distribution of temporal relations between events 



The synthetic data generation program took four parameters. They are num- 
ber of sequences (|74|), average number of events per large item (IT'D, number 
of maximal potentially large item, (Ns), and number of event types (N). The 
data generation model was based on the one used for mining sequential pattern 
[2] with some modihcation to model the patient database. We generated each 
large sequence by hrst picking a number of events from a Poisson distribution 
with mean equal to |T| and we chose the event types randomly. We picked 
a temporal relation between events and formed a pattern. Temporal relations 
are chosen from the set Ref, according to a distribution shown in Figure 3. We 
used the values of {1,2, 3, 4, 5, 6, 7} to represent starts, overlaps, before, during, 
finishes, meets, equal respectively. The distribution was determined arbitrarily 
by our intuitive expectation of the likeliness of the relations. Each person was 
then assigned a potentially large sequence. The time interval of each event fol- 
lowed an exponential distribution with mean p equal to 5. For each sequence, 
the time where the hrst event took place was chosen randomly within the time 
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interval [0,500] of time units. The following events of the item then followed 
the temporal relation held between events. We generated the dataset by setting 
#5=2000, N=1000, |D| = 10K and |T'|=5 with 1MB of data. We studied the ef- 
fect of different values of min_sup, win^size, number of sequences and events 
per sequence for the two methods. 

First, we studied the effect of minimum support on the processing time. 
We used 7 values of minimum support (min_sup) as shown in Figure 4(a), and 
100 time units for window size (win_size) for the test. It shows the execution 
time decreases when the minimum support increases for both methods. As the 
support threshold increases, less large items are generated and hence the size 
of the candidate set in each iteration decreases dramatically. Thus less time is 
required for support counting and hash tree searching of large items. Comparing 
the two methods, the second method needs much more time than that of the 
hrst method. This is due to large amount of computation time in pruning the 
candidates as the addition of two atomic patterns are considered instead of one. 




(a) (b) 



Fig. 4. Variation on minimum support and window size 




Fig. 5. Scale-up: Number of sequences and Number of events per sequence 
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We then studied the effect of window size on the processing time. We chose 
the values of min_sup being 0.0008 for the hrst method and 0.001 for the second 
method. In Figure 4(b), we can see that when the window size increases, the 
execution time increases for both methods. This is because more sequences are 
included and the time for support counting phase increases. Also the number of 
iterations increases and which also requires a longer execution time. 

Our next target is to consider the scale-up effects. Figure 5(a) shows how 
both methods scale up as the number of sequences is increased ten-fold, ranging 
from lOK to lOOK and with min_sup = 0.0025 for both methods. The execution 
time for both methods increase with increasing number of sequences. 

Finally we studied the scale-up as we increase the average number of events 
per sequence. The number of sequences used is lOK and kept constant. We 
vary the average number of events per sequence from 2.5 to 25. Figure 5(b) 
shows the scalability results of the two methods. We set min_sup = 0.0025 for 
both methods. From the hgure, the execution time increases rapidly with the 
increasing number of events per sequence. 

In conclusion, we propose methods for discovering interesting temporal pat- 
terns for interval-based events. A set of experiments has been conducted to 
demonstrate the overall performance of the methods. From experiments, we hud 
that the computation time required for the hrst pattern is much more accept- 
able. 
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Abstract. In real world the knowledge used for aiding decision-making is 
always time varying. Most existing data mining approaches assume that 
discovered knowledge is valid indefinitely. Temporal features of the knowledge 
are not taken into account in mining models or processes. As a consequence, 
people who expect to use the discovered knowledge may not know when it 
became valid or whether it is still valid. This limits the usability of discovered 
knowledge. In this paper, temporal features are considered as important 
components of association rules for better decision-making. The concept of 
temporal association rules is formally defined and the problems of mining these 
rules are addressed. These include identification of valid time periods and 
identification of periodicities of an association rule, and mining of association 
rules with a specific temporal feature. A system has been designed and 
implemented for supporting the iterative process of mining temporal association 
rules, along with an interactive query and mining interface with an SQL-like 
mining language. 



1. Introduction 

Data mining is becoming an important tool for decision-making. It can be used for 
the extraction of non-trivial, previously hidden patterns from databases. Over the last 
few years, many forms of such patterns have been identified in various application 
domains. A typical one is the association rule [1]. Consider a supermarket database, 
where the set of items purchased by a customer is stored as a transaction. An example 
of an association rule is: “60% of transactions that contain bread and butter also 
contain milk; 30% of all transactions contain all these three items.” 60% is called the 
confidence of the rule and 30% the support. The meaning of this is that customers 
who purchase bread and butter also buy milk. In all but the most trivial applications, 
however, the knowledge is time varying. Knowing when something became valid, or 
how often in time a pattern occurs, can provide more valuable information to 
individuals and organisations. For example, in retail applications, people may be more 
interested in the association “customers who buy bread and butter also buy milk 
during the summed’ rather than the much simpler one “customers who buy bread and 
butter also buy milk”. In this paper, temporal features are considered as important 
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components of association rules. The concept of temporal association rules is 
formally defined and a methodology for mining them is suggested. Temporal 
association rules can be extracted by alternatively performing different forms of 
restricted mining tasks, such as identification of valid time periods and identification 
of periodicities of an association rule, and mining of association rules with a specific 
temporal feature. To support the interactive and iterative process of mining temporal 
association rules, an integrated query and mining system (IQMS) has been designed 
and implemented, with an SQL-like mining language TML (Temporal Mining 
Language) for describing the mining tasks. 

The rest of the paper is organised as follows. Section 2 describes related work. 
Section 3 introduces temporal association rules and states the mining tasks for them. 
Section 4 presents a temporal mining support system. Experimental results on the 
number of discovered rules are finally discussed in section 5, followed by 
conclusions. 



2. Related Work 

The problem of finding association rules was introduced in [1]. Given a set of 
transactions, where each transaction is a set of items, an association rule between two 
sets of items, X and Y, is an expression of the form X=>Y, indicating that the 
presence of X in a transaction will also imply the presence of Y in the same 
transaction. Temporal issues were not considered in the original problem modelling, 
where the discovered association rules were assumed to hold in the universal domain 
of time. However, transactions in most applications are usually stamped with time 
values in the database. Temporal issues of association rules have recently been 
addressed in [3], [7] and [8]. The major concern in [3] is the discovery of association 
rules for which the valid period and the periodicity are known. The valid period, such 
as “starting from September 1995 and ending by August 1998”, shows the absolute 
time interval during which an association is valid, while the periodicity, such as 
“every weekend”, conveys when and how often an association is repeated. Both the 
valid period and the periodicity are specified by calendar time expressions. The 
cyclicity of association rules was similarly discussed in [7], where a cycle of an 
association rule was defined as a tuple (1, o) such that the rule holds in every I* time 
unit starting with time unit t^. In [8] the concept of calendric association rules is 
defined, where the association rule is combined with a calendar, such as “all working 
days in 1998”, that is basically a set of time intervals and is described by a calendar 
algebra. 



3. Discovery of Temporal Association Rules: Task Description 

3.1 Temporal Association Rules 

Let /={i[, i^, ... , i„, } be a set of literals which are called items. A set of items X cz 7 is 
called an itemset. Let 7) be a set of time-stamped transactions over the time domain T 
which is a totally ordered set of time instants (or chronons). Each time-stamped 
transaction S is a triplet <tid, itemset, timestamp>, where S.tid is the transaction 
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identifier, S.itemset is a set of items such that S.itemset c 1 , and S. timestamp is an 
instant (or a chronon) which the transaction S is stamped with, such that S. timestamp 
e T. A transaction S contains an itemset X if X c S.itemset. In the discrete and linear 
model of time, a time interval is the time between two instants, basically represented 
by a set of contiguous instants. In general, any temporal feature TF, whatever it is, a 
valid period, a periodicity, or a specific calendar, represents a set of time intervals, 
which is called its interpretation and is denoted by <I>(TF) = {Pj, P^, ..., PJ, where R is 
a time interval. For example, the interpretation of a periodic time PT, <I>(PT), is a set 
of periodic intervals in cycles. The interpretation of a valid time period VI, 0(VI), 
consists of only a contiguous time interval. The interpretation, <I>(CAL), of a specific 
calendar CAL, is a set of all relevant intervals in this specific calendar. 

Definition 3.1: Given a set of time-stamped transactions, a temporal association 
rule is a pair <AR, TF>, where AR is an implication of the form X=>Y and TF is a 
temporal feature that AR possesses. It expresses that during each interval P in 0(TF), 
the presence of X in a transaction implies the presence of Y in the same transaction. 
The frequency of temporal association rules is defined as follows: 

• AR has confidence c % during interval P^, P_ g <I>(TF), if at least c % of 
transactions in D(P) that contain X also contain Y. 

• AR has support s % during interval P,, P. g <I>(TF), if at least ^ % of transactions 
in D(P) contain Xu Y. 

• AR possesses the temporal feature TF with the frequency / % in the transaction 
set D (or saying the database D) if it has minimum confidence min_c % and 
minimum support min_i % during at least/ % of intervals in 0(TF). 

Here, D(P) denotes a subset of D, which contains all transactions with timestamps 
belonging to P^. In the above definition, the notion of frequency is introduced for 
measuring the proportion of the intervals, during which AR satisfies minimum 
support and minimum confidence, to the intervals in <I>(TF). It is required that the 
frequency of any temporal association rule < AR, TF > should not be smaller than the 
user-specified minimum frequency which is a fraction within [0,1]. In the case that 
0(TF) just includes a single interval, the frequency will be omitted since the 
meaningful minimum frequency must be 1 and AR must have minimum support and 
minimum confidence during this single interval. Temporal features can be represented 
by calendar time expressions [4]. For example, the expression “years ■ months (6:8)” 
represents “every summer”. Depending on the interpretation of the temporal feature 
TF, a temporal association rule <AR, TF> can be referred to as: 

• a universal association rule if <I>(TF) = { T }, where T represents the time 
domain; 

• an interval association rule if <I>(TF) = { itvl }, where itvl cz T is a specific time 
interval (e.g., a time period from July 11, 1998 to September 6, 1998); 

• a periodic association rule if <I>(TF) = { Pj, p^, ..., p„ }, where p^ cz T is a periodic 
interval in cycles (e.g., a weekend); 

• a calendric association rule if <I>(TF) = { ca/, ca/, ..., cal_^ }, where cal (Z T is a 
calendric interval in a specific calendar (e.g., the first term in the 1998/1999 
academic year). 
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3.2 Mining Tasks for Temporal Association Rules 

The mining of temporal association rules has a two-dimensional solution space 
consisting of rules and temporal features. From a practical point of view, it is too 
expensive to find all hidden temporal association rules from large databases without 
any given information. The methodology adopted in this paper is simplifying the 
mining problem into restricted mining tasks, seeking different techniques for these 
restricted tasks, and alternatively using these techniques in an interactive and iterative 
mining process with the support of a data mining system. These are identified below. 

3.2.1 Discovery of Longest Intervals in Association Rules 

In many applications, people might just be interested in specific associations of some 
items in the database, but have no any idea about when and/or how often these 
associations hold. These specific associations could be speculated by domain experts 
or may have been found in earlier efforts during the iterative data mining process. 
One temporal feature that is of interest is the valid time period of such associations. 
Given a time-stamped database and a known association, one of our interests is to find 
all possible time intervals over the time domain during which this association holds. 
Each of those time intervals will be composed of a totally ordered set of granules [2]. 
Here, a granule is a minimum interval of some fixed duration, by which the length of 
any time interval is measured. The granularity (the size of a granule) may vary with 
different applications and users’ interests. The interval granularity is normally greater 
than the granularity of instants with which the data are stamped. For example, in a 
supermarket database, a transaction may be stamped by “minute” (i.e., the instant 
granularity is “minute”), but an association may be looked for over the period of a day 
or several days (i.e., the interval granularity is “day”). More often than not people are 
just interested in the intervals, during which an association holds and the duration of 
which is long enough. In terms of different applications, a minimum length of the 
expected intervals can be specified by the users. We define the interestingness of time 
intervals as follows: 

Definition 3.2: A time interval over the time domain is interesting with respect to 
an association rule (AR) if it is longer than the user- specified minimum length and AR 
holds (satisfies the minimum support and confidence) during this interval and any 
sub-interval of this interval that is also longer than the user-specified minimum 
length. 

Given a set of time-stamped transactions (D) over time domain (T) with an interval 
granularity (GC), a minimum support {min_supp), a minimum confidence (min_conf), 
and a minimum interval length (min_ilen), the task of mining valid time periods is to 
find all the longest interesting intervals with respect to a given association AR. 

3.2.2 Discovery of Longest Periodicities in Association Rules 

Another interesting temporal feature is a set of regular intervals in cycles, during 
each of which an association exists. In this paper, a set of periodic intervals in cycles 
is called a periodic time, which is composed of three essential features as follows: 
Cyclicity (a periodic time is always based on a fixed-length cycle, such as a month, a 
week, etc.). Granularity (each periodic interval in cycles consists of a series of 
granules of some fixed duration). Interval Range (all the regular intervals in a 
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periodic time are located in the same position in their corresponding cycles). 
Therefore, a periodic time can be represented by a triplet <Cyclicity, Granularity, 
Range>, which is called an essential representation of the periodic time. For example, 
<Year, Month, [10:12]>, <Week, Day, [2:6]> and <Month, Day, [1:10]> represent 
“last quarter of every year”, “working days of each week”, and “the first ten days of 
every month”, respectively. 

Given a periodic time PT = < CY, GR, [x:y] >, its interpretation <I>(PT) = {Pj, P^, 
..., Pj, ...} is regarded as a set of periodic intervals consisting of the x-th to y-th 
granules of GR, in all the cycles of CY. We define the interestingness of a periodic 
time with respect to an association on the basis of the concept of interesting intervals. 

Definition 3.3: A periodic time PT=<CY, GR, RR> is interesting with respect to 
an association rule (AR) if the ratio of its periodic intervals in cycles that is interesting 
with respect to AR is not less than a user-specified minimum frequency. PT is a 
longest interesting periodic time with respect to AR if: 1) PT is interesting with 
respect to AR, and 2) not 3 PT’ = < CY, GR, RR’ > such that RR‘ 3 RR and PT’ is 
interesting with respect to AR. 

Given a set of time-stamped transactions (D) over a time domain (T), a minimum 
support (min_supp), a minimum confidence (min_conf), a minimum frequency 
{minj^req), a minimum interval length (min_ilen), as well as the cycle of interest(CF) 
and granularity (GR), the task of mining the periodicities of an association (AR) is to 
find all possible periodic times < CY, GR, RR >, which are longest with respect to the 
association AR. Here, the range RR is expected to be discovered. 

3.2.3 Discovery of Association Rules with Temporal Features 

The mining of associations with specific temporal features that reflect the crucial 
time periods in the business might also be of interest to some applications. In this 
case, only the behaviour over those periods is crucial. Those interesting periods are 
characterised by temporal features, such as a single time period (e.g., from July 1996 
to January 1998), a periodic time (e.g., every Friday), or a specific calendar (e.g., 
payment days in 1998). Given a set D of time-stamped transactions over a time 
domain T and user- specified thresholds (minimum support min_s %, confidence 
min_c % and minimum frequency min_f% ), the task of mining association rules with 
a given temporal feature TF is to discover all possible associations of the form X=>Y 
inD that possess the temporal feature TF with the minimum frequency minj' %. 



4. An Integrated Query and Mining System 

The search techniques and algorithms for the above mining tasks for temporal 
association rules have been presented in ([3], [4]) and implemented in an integrated 
query and mining system (IQMS) presented here. The IQMS system supplies mining 
users with an interactive query and mining interface (IQMI) and an SQL-like 
Temporal Mining Language (TML). 

To support the iterative process of the discovery of temporal association rules, the 
IQMS system has been integrated with the functions of both data query and data 
mining. By using this mining system, people can make either an ad hoc query for 
interesting data or an ad hoc mining for interesting knowledge, with an interactive 
query and mining interface (IQMI) (see Figure 4.1). This interface connects a user to 
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any Oracle database and prompts the user to submit any SQL or TML statement. If it 
is a query, the query result will be displayed. If it is a mining task, the mined rules are 
displayed. The system has been implemented with Oracle Pro*C/C++ under a Sun 
SPARCstation 20 running SunOS 5.5. 

An SQL-like data mining language [6] is one of the important tools to support the 
interactive mining process. As the kernel of the IQMS system, the TML language 
supplies users with the ability to express various mining tasks for temporal 
association rules discussed in section 3.2. The structure of TML is described as 
follows: 



SQL Statement TML Statement 




Figure 4. 1 Interactive Query and Mining Interface 



1) Mine Association_Rule ( <rule-specification> ) 

2) [ With Periodicity ( <periodic-specification> ) ] 

3) [ Within Interval ( <interval-specification> ) ] 

4) [ Having Thresholds <threshold-specification> ] 

5) In <query-block> 

A mining task in TML consists of the mining target part (lines 1-4) and the data 
query block. The keyword Association_Rule is a descriptor, followed by the rule 
specification. The rule specification is either a keyword “ALL” or a specific rule. The 
former indicates that all of potential rules should be found from the database and the 
latter expresses that temporal features of this rules are to he extracted from the 
database. The periodicity of rules is specified hy the periodic specification, which is 
either an instantiated periodic expression (e.g., Month*Day(l:5)) or a uninstantiated 
one (e.g., Week*Day(ALL)). The former shows that only rules with this periodicity 
are expected, while the latter expresses that all possible periodicities of a specific rule 
are expected. The interval specification can be used for describing the valid period of 
rules, where the interval specification is either an instantiated interval expression 
(e.g.. After Year(1998)) or a uninstantiated one (e.g., Day(ALL) ). The former 
indicates the specific time period that users are interested in, while the latter expresses 
that all contiguous time intervals during which rules (periodic or non-periodic) may 
exist are to be extracted. Thresholds relevant to different tasks can be stated in the 
Having-Threshold clause. The data relevant to the data mining task can he stated in 
the query block. Consider a Sales database containing two relations Items and 
Purchase as shown in Figure 4.2. Three examples of mining tasks in TML are given 
below. 

Example 4.1: Mining Longest Intervals of an Association Rule 
Mine Association_Rule (“Coca Cola” “Tesco Brown Bread”) 

Within Interval ( day(ALL) ) 

Having Thresholds support = 0.3, confidence = 0.75, interval_len = 6 
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In Select trans_no:Tid,Set(item_name):ItemSet,trans_time: TimeStamp 
From purchase, items 
Where purchase. item_no = items. item_no 
Group By purchase. trans_no ; 

This example describes a task of mining all longest intervals of an association rule 
with the thresholds of support and confidence being 0.3 and 0.75, respectively. The 
interval granularity is specified as ‘day’. The threshold expression “interval_len = 6” 
states that the length of all discovered intervals must be at least 6 days long. In this 
example, the “Select-From-Where-Group” part expresses a basic data query demand. 
The result of the data query is a nested relation, forming the data set relevant to this 
mining task. “Trans-no” and “Set(item-name)” are considered as “Tid” and “Itemsef ’ 
in the mining model for temporal association rules, respectively, and “trans-time” in 
the database is chosen as the time stamp. 

Items 

I item-no | item-name | brand | category | retail-price | 

Purchase 

I trans-no | trans-time | customer | item-no | qty | amount | 

Figure 4.2 Relation Schemes in Sales Database 

Example 4.2: Mining Periodicities of An Association Rule 

Mine Association_Rule (“Coca Cola”, “Flora Butter” ^ “Tesco White Bread”) 
With Periodicity ( week*day(ALL) ) 

Within Interval (starts_from year(1990) and finishes_by year(1998)) 

Having thresholds support = 0.3, confidence = 0.75, frequency = 0.8, 
interval_len = 2 

In Select trans_no: Tid, Set(item_name): ItemSet, trans_time: TimeStamp 
From purchase, items 
Where purchase. item_no = items. item_no 
Group By purchase. trans_no ; 

The mining task in this example is to look for all the periodicities of a specific 
association rule between 1990 and 1998, with the cyclicity being ‘week' and the 
granularity being ‘day’. The minimum interval length of two days. The thresholds of 
support, confidence and frequency are 0.3, 0.75, and 0.8, respectively. 

Example 4.3: Mining Association Rules With a Periodicity 
Mine Association_Rule {ALL) 

With Periodicityi Years -Months(6:8) ) 

Within Interval ( starts J^rom Years( 1 990) -Months(7) ) 

Having Thresholds support = 0.6, confidence = 0.75, frequency = 0.8 
In Select trans-no: TID, Set( item-name): ItemSet, trans-time :TimeStamp 

From purchase, items 

Where purchase.item-no = items. item-no and items.retail-price >= 10 
Group By purchase. trans-no Having Count(*) > 3 

The above is an example of a mining task for finding all periodic association rules 
that convey purchase patterns every summer (assuming that summer starts from the 
seventh month of each year) since July 1990, with the thresholds of support, 
confidence and frequency being 0.6, 0.75 and 0.8, respectively. In this example, we 
are only concerned with items whose retail prices are not less than £10 and 
transactions in which the number of purchased items is greater than 3. 
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By using this system, data miners can explore knowledge from databases in an 
iterative process as shown in Figure 4.3. With the query function, the data in the 
database can firstly be analysed, according to the business requirements, in order to 
get some useful information (e.g., summary information about the data) for designing 
mining tasks. Then, the mining task, expressed in the TML language, can be designed 
by the data miner and fulfilled by the ad hoc mining function of the system. The 
mining results need to be further analysed to judge if the expected knowledge has 
been found or if the mining task should be adjusted to make another mining effort. 
The adjustment could be, for example, the change of the values of thresholds in the 
mining target part, or a modification on the restriction of selected data in the data 
query part. 

A submitted TML statement (a mining task) will be processed in the following 
steps: 

Task Analysis: The statement is syntactically and semantically analysed. The 
outcomes of this step are: i) a query demand is generated for preparing the data 
relevant to the mining task and is represented in an SQL-like intermediate language, 
and ii) an internal mining problem representation consisting of mining target (e.g., 
rules, longest intervals, or periodicities) and thresholds (i.e., minimum support, 
confidence, frequency and interval length) is generated. The generated query and 
extracted mining target representation will be used in the next two steps successively. 

Data Preparation: The data relevant to the mining task are prepared (according to 
the query demand generated) in this step. These data are selected from the database 
and are organised in the format required by the mining algorithms. To reduce the 
redundant data access to operational databases, an internal data cache is used. The 
data cache consists of a fixed number of temporary tables, storing selected data sets 
relevant to the most recently performed mining tasks. The corresponding information 
about each data set in the cache is organised by a cache manager. 

Rule Search: The search for possible rules from the prepared data is performed in 
this step, according to the mining target and thresholds. In terms of the mining target, 
the system will execute the appropriate search algorithm. These algorithms have been 
presented in [3], [4]. 

Current commercial DBMS products support only a few limited operations on date 
and time types. Some specific time mechanisms are required for the processing of 
temporal aspects in the mining tasks. In this system, a number of time-related 
functions have been implemented. 




Figure 4.3 IQMl-Based Mining Process 
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5. Experimental Results 

Synthetic data have been widely used for the evaluation of the mining of 
association rules ([5], [7]). Three datasets that mimic the transactions within one year 
in a retailing application have been generated and used for evaluating the approach 
presented in this paper. Each transaction in the datasets is stamped with the time 
instant at which it occurs. We assume that the numbers of transactions in different 
granules (each day) within a cyclic period (each month) are binomially distributed 
and the number of transactions occurring during each cyclic period is picked from a 
Poisson distribution with a given mean value. The average size of transactions is set 
as 10 and the average number of transactions during each granule (a day) is set as 
100. All the datasets are stored in the Oracle 7.3.2 system. The performance analysis 
of the algorithms implemented in the system is out of range of this paper. Here, the 
discussion will only focus on the number of discovered rules. Figure 5.1 depicts the 
number of discovered universal rules by not considering time issues, and the number 
of discovered temporal rules with the periodicity <Month, Day, [13:17]>, which 
satisfy the minimum support (0.05) and confidence (0.6) and different minimum 
frequencies from 0.1 to 1.0. These results shows that a number of rules that may occur 
during a series of time periods of interest could be missed if only the approach for 
mining universal association rules is considered. For example, only 17 universal rules 
can be discovered from dataset 2 if temporal features are not taken into account. 
However, there are 39 rules that satisfy the support and confidence thresholds during 
the time periods between the 13th day and the 17th day within at least three months 
(see the column marked 0.3 in Figure 5.1) . If the users are more interested in the 
rules occurring during this specific period (between 13th and 17th), the approach for 
mining temporal association rules should obviously be adopted. Moreover, the 
techniques for mining temporal features of rules can further be used if the actual time 
periods during which the rules occur are expected to be identified by data miners. In 
addition, it is also shown in Figure 5.1 that the value of minimum frequency is one of 
the most important factors that decide the number of discovered rules. It is important 
to choose an appropriate threshold according to application requirements, in order to 
discover all potential useful rules. Consider dataset one, for example, when the 
minimum frequency is set to be greater than 0.6, the potential rules discovered will be 
rare, while if the minimum frequency is too small (say, 0.1), the number of potential 
rules discovered will be very large (see the first column on the left in Figure 5.1). 
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6. Conclusions 

The concentration of this paper is on the temporal aspects of association rules. 
Three forms of mining tasks for temporal association rules were addressed. By 
alternatively fulfilling these mining tasks, temporal association rules can be extracted 
from databases in an iterative and interactive process. This process is supported by the 
integrated query and mining system with an interactive query and mining interface 
(IQMI) and an SQL-like temporal mining language (TML). This paper introduced the 
IQMI interface, described the TML language, presented the IQMI-based mining 
process, and discussed the implementation issues. The results of experiments with 
synthetic datasets show that many time-related association rules that would have been 
missed with traditional approaches can be discovered with the approach presented in 
this paper. The significance of the integrated system has been shown in three aspects 
in the dynamic data mining efforts. Firstly, the data selection and sampling for 
different data mining tasks are easy to achieve with the query function that is 
integrated in the system. Secondly, ad-hoc mining for different application 
requirements is possible to fulfil with the mining language supported by the system. 
Finally, all data mining activities can be undertaken in a more flexible data mining 
process based on IQMI. 
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Abstract. Much of the data mining research has been focused on devising 
techniques to build accurate models and to discover rules from databases. 
Relatively little attention has been paid to mining changes in databases collected 
over time. For businesses, knowing what is changing and how it has changed is 
of crucial importance because it allows businesses to provide the right products 
and services to suit the changing market needs. If undesirable changes are 
detected, remedial measures need to be implemented to stop or to delay such 
changes. In many applications, mining for changes can be more important than 
producing accurate models for prediction. A model, no matter how accurate, 
can only predict based on patterns mined in the old data. That is, a model 
requires a stable environment, otherwise it will cease to be accurate. However, 
in many business situations, constant human intervention fi.e., actions) to the 
environment is a fact of life. In such an environment, building a predictive 
model is of limited use. Change mining becomes important for understanding 
the behaviors of customers. In this paper, we study change mining in the 
contexts of decision tree classification for real-life applications. 



1. Introduction 

The world around us changes constantly. Knowing and adapting to changes is an 
important aspect of our lives. For businesses, knowing what is changing and how it 
has changed is also crucial. There are two main objectives for mining changes in a 
business environment: 

1. To follow the trends: The key characteristic of this type of applications is the word 
"follow". Companies want to know where the trend is going and do not want to be 
left behind. They need to analyze customers' changing behaviors in order to 
provide products and services that suit the changing needs of the customers. 

2. To stop or to delay undesirable changes: In this type of applications, the keyword 
is "stop". Companies want to know undesirable changes as early as possible and to 
design remedial measures to stop or to delay the pace of such changes. For 
example, in a shop, people used to buy tea and creamer together. Now they still 
buy tea, but seldom buy creamer. The shopkeeper needs to know this information 
so that he/she can find out the reason and design some measures to attract 
customers to buy creamer again. 

In many applications, mining for changes can be more important than producing 
accurate models for prediction, which has been the focus of existing data mining 
research. A model, no matter how accurate, in itself is passive because it can only 
predict based on patterns mined in the old data. It should not lead to actions that may 
change the environment because otherwise the model will cease to be accurate. 
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Building models for prediction is more suitable in domains where the environment is 
relatively stable and there is little human intervention (i.e., nature is allowed to take 
its course). However, in many business situations, constant human intervention to the 
environment is a fact of life. Companies simply cannot allow nature to take its course. 
They constantly need to perform actions in order to provide better services and 
products. For example, in a supermarket, there are always discounts and promotions 
to raise sale volume, to clear old stocks and to generate more sales traffic. Change 
mining is important in such situations because it allows the supermarket to compare 
results before and after promotions to see whether the promotions are effective, and to 
find interesting changes and stable patterns in customer behaviors. 

Even in a relatively stable environment, changes (although in a slower pace) are 
also inevitable due to internal and external factors. Significant changes often require 
immediate attention and actions to modify the existing practices and/or to alter the 
domain environment. Let us see a real-life example. 

A company hired a data mining consultant firm to build a classification model 
from their data. The model was built using the decision tree engine in a commercial 
data mining system. The accuracy was 81% at the time when the model was built. 
However, after the Asia financial crisis, the model only worked 60% of the time. The 
company asked the consultant why the classification model did not work any more. 
The reason, of course, is simple, i.e., the training data used to build the model (or 
classifier) was collected before the financial crisis. The consultant firm then built a 
new model (a classifier) for the company using the data collected after the financial 
crisis. The model was again accurate. However, after a while the company realized 
that the new model did not help. The reason is also simple. The company's profit is 
dropping and an accurate model could not stop this decline. What they really need is 
to know what has changed in the customer behaviors after the financial crisis so that 
they can perform some actions to reverse the situation. This requires change mining to 
compare the data collected from different periods of time. 

In this paper, we study change mining in the contexts of decision tree 
classification. The study is motivated by two real-life data mining applications (see 
Section 3.2). In these applications, the users want to find changes in their databases 
collected over a number of years. In our literature search, we could not find suitable 
techniques to solve the problems. We thus designed a method for change mining in 
decision tree classification. This method has been incorporated into a decision tree 
algorithm to make it also suitable for mining changes. 

There are existing works that have been done on learning and mining in a changing 
environment. Existing research in machine learning and computational learning 
theory has been focused on generating accurate predictors in a drifting environment 
[e.g., 14, 5, 7, 17]. It does not produce the explicit changes that have occurred. In data 
mining, [1, 4] addressed the problem of monitoring the support and confidence 
changes of association rules. [6] gave a theoretical framework for measuring changes. 
We will discuss these and the other related works in Section 4. 



2. Mining Changes in the Decision Tree Model 

Decision tree construction is one of the important model building techniques. Given a 
data set with a fixed discrete class attribute, the algorithm constructs a classifier of the 
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domain that can be used to predict the classes of new (or unseen) data. 

Traditionally, misclassification error rate is used as the indicator to show that the 
new data no longer conforms to the old model. However, the error rate difference 
does not give the characteristic descriptions of changes, as we will see below. 
Additional techniques are needed. 

In a decision tree, each path from the root node to a leaf node represents a hyper- 
rectangle region. A decision tree essentially partitions the data space into different 
class regions. Changes in the decision tree model thus mean changes in the partition 
and changes in the error rate (see Section 2.3). Our objective in change mining is: 

• to discover the changes in the new data with respect to the old data and the old 
decision tree, and present the user with the exact changes that have occurred. 

The discovered changes should also be easily understood by the user. Our application 
experiences show that changes are easily understood if they are closely related to the 
old decision tree or the old partition. 

2.1 Approaches to change mining 

Below, we present three basic approaches to change mining in the decision tree 
model: new decision tree, same attribute and best cut, and same attribute and same 
cut. The first two approaches modify the original tree structure, which makes the 
comparison with the original structure difficult. The third approach is more 
appropriate and it is the method that we use. Here, we discuss these three approaches. 
The detailed algorithm for the third approach is presented in Section 2.2. Note that the 
basic decision tree engine we use in our study is based on that in C4.5 [15]. We have 
modified it in various places for change mining purposes. 

1 . New decision tree: In this method, we generate a new decision tree using the new 
data, and then overlay the new decision tree on the old decision tree and compare 
the intersections of regions. The intersection regions that have conflicting class 
labels are the changes. This idea was suggested in [6]. 

2. Same attribute and best cut: This method modifies the decision tree algorithm so 
that in generating the new tree with the new data, it uses the same attribute as in the 
old tree at each step of partitioning, but it does not have to choose the same cut 
point for the attribute as in the old tree. If the algorithm has not reached the leaf 
node of a particular branch in the old tree and the data cases arrive here are already 
pure (with only one class), it stops, i.e., no further cut is needed. If any branch of 
the new tree needs to go beyond the depth of the corresponding branch in the old 
tree, the normal decision tree building process is performed after the depth. 

3. Same attribute and same cut: In this method, we modify the decision tree engine 
so that in building the new tree, it not only uses the same attribute but also the 
same cut point in the old tree. If the algorithm has not reached the leaf node of a 
particular branch in the old tree and the data cases arrive here are already pure, it 
stops. If any branch of the new tree needs to go beyond the depth of the 
corresponding branch in the old tree, the normal process is performed. 

In all three approaches, decision tree pruning is also performed. Let us use an 
example to show the differences among the three approaches. We use the iris data set 
from UC Irvine machine learning repository [12] in the example. The Iris data has 4 
attributes and 3 classes. We only use two attributes (and all 3 classes) here for 
Illustration. The data points in the original data set are drawn in Figure 1 together with 
the partition produced by the decision tree engine. Next, we introduce changes in the 
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data by shifting some data points of setosa class in region 1 (Figure 1) toward left and 
put some versicolor class points in the space vacated, see the shaded area in Figure 2. 




o sefosa 
A versicolor 
+ virginicQ 



Fig. 1. Partition produced by decision tree on the original data 




sepoL length 

Fig. 2. The introduced change 



We look at approach 1 first. Figure 3 shows the partition produced hy approach 1 on 
the new data after the changes have been introduced. From Figure 3 alone, it is not 
clear what the changes are. 




o sefosQ 
A versicolor 
+ vlrginioQ 



Fig. 3. Partition produced hy decision tree on the new data 
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Figure 4 shows the overlay of this partition (Figure 3) on the old partition (Figure 1). 
Dashed lines represent the old partition (produced from the old data). The shaded 
areas are the conflicting regions, which represent changes. 




o s^osa 
A versicolor 
+ virgirica 



sepal_ length 



Fig. 4. Overlay of the new partition on the old partition 



Clearly, the result is not satisfactory. It produces changes that do not actually exist, 
i.e., the intersection areas marked 2, 3, and 4. These problems become more acute 
when the number of attributes involved is large. The reason is that the changes in data 
points caused the decision tree engine to produce a completely new partition that has 
nothing to do with the old partition (it is well known that the decision tree algorithm 
can result in a very different tree even if only a few data points are moved slightly). 

The same problem exists for the second approach, same attribute and best cut. 
Figure 5 shows the new partition. Although the algorithm tries to follow the attribute 
sequence in the old tree, but the cut points can change drastically, which make it hard 
to compare with the old tree. Again we need to use overlay to find changes. Hence, it 
has the same problem as the first approach. 

The third approach (same attribute and same cut), on the other hand, does not have 
the problems. Figure 6 shows the partition obtained. The shaded area represents the 
change, which is precisely what has been introduced (see Figure 2). The change is 
also closely related to the old tree, and thus easily understood. We present the details 
of the third approach below, which is what we use in our system. 




o s^osa 
A versicolor 
+ yi rgirica 



Fig. 5. Partition produced by the second approach 
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I o s^osa 
■ A versicolor 
I + virginica 



Fig. 6. Partition produced by the third approach 



2.2. Change mining for the third approach: same attribute and same cut 

The conceptual algorithm for the third approach is given in Figure 7. Let (?j,be the old 
tree, and be the new data. Let the decision tree building algorithm be buildTree(). 

Algorithm mineChangelOj-. ^ d ) 

1 Force buildTree(Ao) to follow the old tree Oj (both the attributes and cut points) and 
stop earlier if possible; 

2 Test for significance of the error rate change at each old leaf node and show the user 
those leaves that have significant error rate changes; 

3 Grow the new tree further (not only those branches that have changed significantly); 

4 Pmne the tree to remove those non-predictive branches; 

5 Traverse the new tree and compare it with Or to identify changes; 



Fig. 7. The proposed algorithm 



Five points to note about the algorithm: 

a. In line 1, the algorithm basically tries to follow the old tree. 

b. Line 2 uses chi-square test [13] to test the significance of the changes in error rates. 

c. In line 3, we allow the decision tree algorithm to grow the new tree further, not 
only those branches that have significant changes. The reason is that those 
branches that do not result in significant changes may be further partitioned to 
result in more homogenous regions, which is not possible in the old data. 



(a) Partition from the old data 
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(b) Partition from the new data 



Fig. 8. Partitioning the region further 

For example, in the old data in Figure 8(a), the shaded region cannot be 
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partitioned further because the error points (A) are randomly positioned. However, 
in the new data, the error rate for the shaded region remains the same as in the old 
partition, but it can now be further refined into two pure regions (Figure 8(b)). 

d. In line 4, the new tree is subjected to pruning. Pruning is necessary because the 
new tree can grow much further and result in overfiting the new data. As in the 
normal decision tree building, we need to perform pruning to remove those non- 
predictive branches and sub-trees. There are two ways to perform pruning in 
normal decision tree building [15]. They are: 

1) discarding one or more sub-trees and replacing them with leaves; 

2) replacing a sub-tree by one of its branches. 

For change mining, the pruning process needs to be restricted. We still use (1) as 
it is because it only results in join of certain regions, which is understandable. 
However, (2) needs to be modified. It should only prune till the leaf nodes of the 
old tree. Those nodes above will not be subjected to the second type of pruning 
because otherwise it can result in drastic modification in the structure of the old 
tree, thus making the changes hard to identify and to understand. 

e. In line 5, the algorithm traverses the new tree and compares it with the 
corresponding nodes in the old tree to report the changes to the user. Below, we 
present the types of changes that we aim to find using the proposed approach. 

2.3 Identify different types of changes 

There are many kinds of changes that can occur in the new data with respect to the old 
data. Below, we identify three main categories of changes in the context of decision 
tree building: partition change, error rate change, and coverage change. The first 
four types of changes presented below are partition changes, type 5 is the error rate 
change and type 6 is the coverage change. Their meanings will be made clear later. 

Type 1. Join of regions: This indicates that some cuts in the old tree (or partition) 
are no longer necessary because the data points in the new data set arrive in the split 
regions are now homogeneous (of the same class) and need no further partitioning. 

Type 2. Boundary shift: This indicates that a cut in the old tree is shifted to a new 
position. It only applies to numeric attributes. Boundary shifts only happen at those 
nodes right above the leaf nodes of the old tree. It is not desirable to allow boundary 
shifts in the earlier nodes because otherwise, the whole tree can be drastically 
changed resulting in the problem discussed in Section 2.1. 

Type 3. Further refinement: This indicates that a leaf node in the old tree can no 
longer describe the new data cases arrive at the node (or the region represented by the 
node). Further cuts are needed to refine the node. 

Type 4. Change of class label: This indicates that the original class of a leaf node 
in the old tree has changed to a new class in the new tree. For example, a group of 
people who used to buy product- 1 now tend to buy product-2. 

Partition changes basically isolate the changes to regions (which can be expressed 
as rules). They provide the detailed characteristic descriptions of the changes. They 
are very useful for targeted actions. Though useful, partition changes may not be 
sufficient. Error rate change and coverage change are also necessary. The reasons are 
two-fold. First, sometime we cannot produce a partition change because the changes 
of the data points are quite random, i.e., we cannot isolate the changes to some 
regions. Nevertheless, there is change, e.g., the error rate has increased or decreased. 
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or the proportion of data points arrives at a node has increased or decreased. Second, 
even if we can characterize the changes with regions, the data points in the regions 
may not be pure (i.e., with different classes of points). In such cases, we need error 
rate change to provide further details to gauge the degree of change. 

Type 5. Error rate change (and/or class distribution change): This indicates that 
the error rate of (and/or the class distribution of the data points arrive at) a node in the 
new tree is significantly different from the error rate of the same node in the old tree. 
For example, the error rate of a node in the old tree is 10%, but in the corresponding 
node of the new tree it is 40%. 

Type 6. Coverage change: This indicates that the proportion of data points arrives 
at a node has increased or decreased significantly. For example, in the old tree, a node 
covers 10% of the old data points, but now it only covers 2% of the new data points. 

The partition changes (the first four types) can be easily found by traversing the 
new tree and comparing it with the old tree. The information on the last two types of 
changes is also easily obtainable from the final new tree and the old tree. 

3. Experiments and Applications 

We evaluate the proposed techniques using synthetic data sets and real-life data. The 
goal is to assess the effectiveness of the proposed technique (the third approach in 
Section 2.1). Efficiency is not an issue here because it uses an existing technique for 
decision tree building [15], which has been proven very efficient. 

3.1 Synthetic data test 

We implemented a synthetic data generator to produce data sets of 2 classes. It takes 
as input, the number of attributes (all attributes are numeric attributes), range of 
values for each attribute, the number of data regions with attached classes, the 
locations of the data regions to be generated, and the number of data points in each 
region. Basically, the data generator generates a number of regions with data points in 
them, and each data point is also labeled with a class. Each region is in the shape of a 
hyper-rectangle (each surface of the hyper-rectangle is parallel to one axis and 
orthogonal to all the others). The data points in each region are randomly generated 
using a uniform distribution. In the new data generation, we introduce changes by 
modifying the input parameters used for the old data set generation. 

In our experiments (the results are summarized in Table 1), we used data sets with 
3, 5, 8 and 10 dimensions. We first generate 3 old data sets (i.e.. No. of Expts in Table 
1) with 6, 8 and 10 data regions for each of the 3, 5, 8 and 10 dimensional spaces. For 
each old data set, we then generate one new data set and introduce all 6 types of 
changes at different locations. We then run the system to see if the planted changes 
can be identified. Experiment results show that all the changes embedded are found. 



Table 1. Synthetic data test 



No. of Expts 


No. of dimensions 


Types of changes introduced 


Changes found 
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3.2 Real-life Data Tests 

We now describe our two real-life data tests. Due to confidentiality agreement, we 
could not disclose the detailed findings. The first application is for an educational 
institution that wishes to find out more about its students. We were given the data 
from the past few years. The data includes the student's examination results, family 
background, personal particulars etc. Our user was interested in knowing how the 
performances of different groups of students had changed over the years. This 
requires change mining. A base year was chosen to build the old tree and the program 
was run by comparing the subsequent years with the base year decision tree. Some 
interesting trends immediately became apparent using our technique. For instance, we 
were able to tell our user that the performances of certain group (with some 
characteristics) of students in a particular subject had steadily deteriorated over the 
years. In addition, we also discovered that in a particular year, a group of students 
suddenly outperformed another group that had consistently been the better students. 

The second application involves the data from an insurance company. The user 
knew that the number of claims and the amounts per claim had risen significantly 
over the years. Yet, it was not clear whether there were some specific groups of 
people who were responsible for the higher number of claims or that the claims were 
just random. In order to decide suitable actions to be taken, our user wanted to know 
what are the claiming patterns of the insurers over the years. Using data from the past 
five years, our system again discovered some interesting changes. We found that 
certain groups of insurers had gradually emerged to be the major claimers over the 
years. On the other hand, there exists another group of insurers that no longer had any 
claims even though in the beginning, they did put in claims. 

4. Related Work 

Mining and learning in a changing environment has been studied in machine learning 
[e.g., 14, 17], data mining [1, 4] as well as in computational learning theory [5, 7]. In 
machine learning, the focus is on how to produce good classifiers in on-line learning 
of a drifting environment [14, 17]. The basic framework is as follows: The learner 
only trusts the latest data cases. This set of data cases is referred to as the window. 
New data cases are added to the window as they arrive, and the old data cases are 
deleted from it. Both the addition and deletion of data cases trigger modification to 
the current concepts or model to keep it consistent with the examples in the window. 
Clearly, this framework is different from our work as it does not mine changes. 

In computational learning theory, there are also a number of works [5, 7] on 
learning from a changing environment. The focus is on theoretical study of learning a 
function in a gradually changing domain. They are similar in nature to those works in 
machine learning, and they do not mine changes as the proposed method does. 

[6] presented a general framework for measuring changes in two models. 
Essentially, the difference between two models (e.g., two decision trees, one 
generated from data set Di and one generated from data set D 2 ) is quantified as the 
amount of work required to transform one model into the other. For decision trees, it 
computes the deviation by overlaying of two trees generated from two data sets 
respectively. We have shown that overlaying of one tree on another is not satisfactory 
for change mining. The proposed method is more suitable. 
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Another related research is subjective interestingness in data mining. [10, 11, 16] 
gives a number of techniques for finding unexpected rules with respect to the user's 
existing knowledge. Although the old model in our change mining can be seen as "the 
user's existing knowledge" (it is not from the user), interestingness evaluation 
techniques cannot be used for mining changes as its analysis only compares each 
newly generated rule with each existing rule to find the degree of difference. It does 
not find which aspects have changed and what kinds of changes have occurred. 

5. Conclusion 

In this paper, we study the problem of change mining in the contexts of decision tree 
classification. This is motivated by two real-life applications. We could not find an 
existing technique to solve the problems. This paper proposed a technique for the 
purpose. Empirical evaluation shows that the methods are very effective. 

We believe change mining will become more and more important as more and 
more data mining applications are implemented in production mode. In our future 
work, we plan to address the problem of evaluating the quality of the changes and to 
detect reliable changes as early as possible by on-line monitoring. 
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Abstract. Time-series data mining presents many challenges due to 
the intrinsic large scale and high dimensionality of the data sets. Sub- 
sequence similarity matching has been an active research area driven by 
the need to analyse large data sets in the hnancial, biomedical and scien- 
tihc databases. In this paper, we investigate an intelligent subsequence 
similarity matching of time series queries based on efhcient graph traver- 
sal. We introduce a new problem, the approximate partial matching of a 
query sequence in a time series database. Our system can address such 
queries with high specihcity and minimal time and space overhead. The 
performance bottleneck of the current methods were analysed and we 
show our method can improve the performance of the time series queries 
signihcantly. It is general and flexible enough to hnd the best approximate 
match query without specifying a tolerance e parameter. 



1 Introduction 

Subsequence similarity matching in time series data suffers from the curse of 
dtmenswnaUty, since a query sequence may be matched to any offset in the 
sequence database. A popular method to address this problem, is to use an 
orthogonal transform (such as FFT[1], wavelets[4]) to map the data points to 
a feature space, and extracting features to index the sequence into a spatial 
index. There has been significant efforts in finding similar sequences using linear 
transformations [9, 12], probabilistic methods [10], and episodic events [11]. 

This paper focuses on approximate similarity matching of a user-defined 
query subsequence in a sequence database. We introduce the notion of partial 
similarity matching of the query, in which we attempt to match significant fea- 
tures of the input query sequence. Obviously, a partial match of a short query 
section is relatively unimportant. In the extreme case, a single point may match 
but this is not significant and is likely of little importance to the user. A max- 
imal partial match of the query may be the most significant to user, since this 
identifies the major features of the query. Our method optimizes the expensive 
post-processing stage of the current search-and-prune methods [7,4,12], so it 
can improve the performance of all these methods. 

The outline of the paper is as follows: we begin with a background of the time 
series similarity problems and the current methods. An analysis of the ST-index 
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TimeseriesSequence 

— — SimulatedQuery 




Fig. 1. Random walk sequence and Simulated user-defined query 



method [7] leads us to the motivation of our study. We present some interest- 
ing problems that are not feasible with the current technology, and we propose 
our method of addressing the approximate matching, based on efhcient graph 
traversal. We demonstrate the effectiveness of our system with performance ex- 
periments, followed by a discussion of the results. 

2 Background 

An early work in time-series subsequence szmzlanty matching introduced the 
ST-index [7] which combined feature mapping with spatial indexing (further 
details given below). Agrawal et. al. [2] dehned subsequence similarity as non- 
overlapping piecewise matching segments with minimal gaps. They used a win- 
dow stitching algorithm to ignore small gaps where there may be an outlier. A 
partial match can be considered a long segment match with an inhnite length 
gap, but it is not clear how the above methods can be adapted to efhciently 
handle this. 

The distinction between exact and approximate similarity matching was for- 
malized for time-series data in [9], and the dehnition of similarity was extended 
to include shape-preserving transformations like normalization, shifting and scal- 
ing. Subsequently, other transformation-based similarity queries such as invert- 
ing, moving average [12], and warping [13] have been presented. Since those tech- 
niques are typically performed before the spatial index search, and we address 
the post-processing stage, our method is complementary to these transforms and 
may be used in conjunction with these methods. 

In current similarity search methods, there is a heavy reliance on the user- 
specihed tolerance e. The query results and performance are intrinsically tied to 
this subjective parameter, which is a real usability issue. We give an approximate 
best match query that eliminates this requirement from the user. This is similar 



AIM: Approximate Intelligent Matching for Time Series Data 349 



Table 1. Summary of symbols and their definitions. 



Symbols 


Definitions 


S 


A time-series sequence. 


Q 


A query sequence. 


w 


FFT window size; same as |FFT|. 


N 


Number of subqueries (or query segments) in Q. 


w 


the subquery and subtrail window size. 


s[q 


A point i in the sequence S. 


Q[*] 


A point i in the query Q. 


D{S, Q) 


Euclidean distance between S and Q of equal lengths; 




The i-th subsequence in a time-series sequence. 


Qz 


The i-th subquery (or query segment) in a query sequence. 


S.[J] 


The j-th point in s, subsequence. 


qdj] 


The j-th point in subquery. 


c, 


A candidate subsequence containing candidate segments which 
match a particular subquery. 




A candidate segment in Cj for subquery gj. 


n 


Number of query segments matching a subsequence in partial matching. 


e 


Error tolerance for similarity matching. 


^ u 


Unit error tolerance; the maximum error allowed 
when matching Sj against q. 


6 Q 


Proportional error tolerance; the maximum error 
allowed in a partial match. 



to the nearest neighbour query mentioned in [4], but their algorithm considers 
only single piece queries. Handling long multi-piece queries face the same pruning 
inefficiencies as the ST-index. 



Analysis of ST-index We chose to analyse the ST-index since its search-and- 
prune style is the basis for many of the current methods in similarity search [9, 
4, 12, 13]. A basic sketch of the ST-index is provided here but we encourage the 
reader to review the details in [7]. First, the input sequence S is mapped from 
the time domain to frequency domain using the fast fourier transform (FFT). By 
taking the hrst few FFT coefficients (2-3) of a sliding window w over the data 
sequence, the time series is transformed into a feature trail in multidimensional 
space (hgure 2). The trails are divided into sub-trails which are subsequently 
represented by a Minimum Bounding Rectangle (MBR) and stored in a spatial 
index structure, the R*tree[3]. 

A sequence query Q is decomposed into query segments qi, q 2 , ' ' ' In of win- 
dow size w. Each query segment qi is mapped to a point in feature space, and 
used as a multidimensional point query into the R*tree (hgure 3). A point sub- 
query search performed on the R*tree returns all the intersecting MBR’s. This 
results in a set of possible candidates for each sub-query. 

For instance, in hgure 4, a query Q with N subquery segments of window 
size w = 100, gives a list of sub-trails in the sequence database identihed by the 
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starting point offset in the sequence; that is, the actual matching segment begins 
somewhere in that sub-trail. 

In the ST-index, the Euclidean distance D[S, Q) < s is used to compare the 
subsequences against the query to discard the false alarms. 

n 

D\S,Q) = Y,{S\i]-Q\i]f <e^ (1) 

i=i 

For a query with error tolerance e, a matching subsequence must have at 
least one subtrail s,- within e/\/N radius of the multidimensional sub-query g,- 
point; the Lemma is given in [7]. 

D^{si,qi)= ^ {si[j]- qi[j]f < /N (2) 

j=i*w-\-l 

We call this the umt error tolerance The subtrail found to be within 
Su of the subquery is a candidate for a full query match. Depending on how 
tight the e bound is specified, the subquery check reduces the candidate set and 
then a post-processing check of the full query against the raw time series data is 
performed, to finally remove all the false positives and find the true subsequences 
which match the query. 

3 Motivation 

The ST-index performance can be broadly broken down into two steps: the 
spatial index (R*tree) search, and the post-processing stage. Spatial and high- 
dimensional indexing is a very active research area [8] so any advancements 
there can be applied to the spatial indexing stage. Yet, there has not been 
much attention given to the post-processing phase of the similarity search. We 
believe the current methods are not particularly optimized for longer approx- 
imate query matching. In fact, we hypothesize that under realistic conditions 
the post-processing time may be the bottleneck in the ST-index. Specifically, 
the post-processing step where each candidate from the R*tree search must be 
checked as a potential match can be a very time consuming process. 
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We ran performance studies to identify the spatial index and post-processing 
times required for the similarity matching. For a random walk sequence S of 
length 20000 points with mean= 0 and variance= 1, the tolerance was set to 
e = ^/W to allow sufhcient candidate matches. Notice, in hgure 5, as the query 
length Q gets longer, the post-processing phase contributes to the majority of 
the overall ST-index time. Further, the post-processing time after the R*tree 
search becomes proportionally higher. This can be explained by the fact that 
a longer query has more segment pieces rji, and since each subquery search in 
the R*tree returns a set of candidate segments, there are more candidates on 
which false positive checks must be made. Flence, for a long query, the post- 
processing time to remove the false positives dominates the total time for the 
ST-index. Similarly, queries with larger tolerance e generated more candidates 
for post-processing and scaled poorly. 

We were motivated to study approximate similarity matching of long queries 
by the inflexibility of the current query methods. A long time-series query with 
multiple features (like double peaks shown in Figure 1) may not have a match 
within a given tolerance, in the time-series database. The user may have specihed 
the error tolerance too restrictively, which does not allow a close but inexact 
match to be uncovered. 

Partial Query Match Given a user-supplied query Q and error e > 0, we seek 
a partial approximate match in the time-series database. Although there may 
not be a match for the full query Q in the sequence database, we would like 
to identify the subsequences which match a significant portion of the query. 

Suppose a subsequence has a signihcant portion of the query matching but the 
non-matching segment causes the distance metric to surpass the allowable tol- 
erance. It may be preferable to return to the user some answers that are partial 
matches of the input query rather than not returning any response at all. This 
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is analogous to intelligent query answering in relational databases, and we feel 
it has relevance with time-series data. 

As shown above, the query performance and results can be substantially 
affected by the user’s choice of e. Even though the user may not be aware of how 
the query length is related to the error tolerance, s/he is required to input such 
a subjective parameter. We feel this is putting too much of a burden on the user 
who wants to perform a subsequence query, but has no intimate knowledge of 
the data scale or distance metric. 

Best Approximate Match Given an input query Q and without an error tol- 
erance e, the Best Approximate Matches are the subsequences in the se- 
quence database S which has the N lowest Euclidean errors from the input 
query Q. 

This would be a more realistic scenario, where the user just enters a query and 
requests the top N matches in the sequence database. Now, all the candidates 
returned from the R*tree search are potentially a “best match” . 

4 Proposed Method 

We observed the fact that the set of candidates returned from the spatial index 
(R*tree) search form a directed acyclic graph (DAG). Thus, if we consider the 
subtrails as nodes in a graph, a matching subsequence corresponds to the longest 
path in the DAG. Hence, we can apply an efhcient graph traversal method to 
the candidate sets to determine the partial or best match. Now, the similarity 
match can be reformulated as a longest path m a DAG problem with an error 
cost function. Depth-Rrst search (DES) is known to be an optimal solution to 
the longest path problem in a DAG [5]. 

Also we exploit the DES node-coloring technique whereby if a node has been 
visited, then the node is colored so there is no need to traverse down that node 
again. This quickly reduces the branching factor of the candidate sets, which 
leaves very few candidates for the expensive distance measurement against the 
query sequence. 

There are two kinds of windows in this algorithm; the EET window which 
is a sliding window that generates a trail in feature space, and the subtrail win- 
dows (represented by sj* in the algorithm below) which are adjacent, sequential 
windows. Without loss of generality, let the EET window size equal the subtrail 
window size, \FFT\ = |Sj*|. Our algorithm will also work if the subtrail window 
size is some multiple of the EET window size. There will be a higher branch- 
ing factor from each node but in general our algorithm is still applicable. Eor 
demonstrative purposes, we will assume \FFT\ = |sj*| throughout rest of the 
paper. The algorithm is given next and we explain why our method works. 



DFS Algorithm 

1. A query Q is decomposed into query segments qi,q 2 , • • An, and each query 
segment qi is used as a multidimensional point query into the R*tree. 
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2. Then each subquery rji returns a set of candidate subsequences Ci from the 
R*tree: 
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The superscript refers to the subquery rji that generated this candidate set. 
The subtrail segments form nodes in the candidate graph shown in hgure 4. 

3. For each candidate segment sj , we hud the lowest point of error compared 
to the subquery piece g,-. Since the subquery piece and the segment window 
is small, a distance check to hud the lowest error in that segment is relatively 
fast. We check that the subtrail is within unit error 

4. To hud the longest matching subsequence, we use a greedy approach to 
expand the matching segment. In other words, we expand the subsequence 
in the direction with lowest segment error. We keep track of the total error to 
see that accumulative error for the subsequence is within the proportional 
error £„ (see below). The matching subsequence is expanded out in the 
direction of lowest error until sum of all £„ is greater than £„ or the query 
size IQ I is reached. We store the matching offset information in an array 
sorted by the distance error. 

5. For another candidate segment if the starting point of this segment is 
already included a previous subsequence match, then there is no need to 
match this subtrail. We can skip the longest path search for this segment, 
which has the same effect of skipping a visited node on a DFS graph traversal. 
This results in very fast pruning of candidate nodes in the longest path 
search. 

Typically a graph search is considered an exhaustive operation, but in our 
scenario, DFS is very efhcient and much faster than going back to the time-series 
and performing a post-processing sequential match on the raw data. Although 
there are many nodes (ie. candidate segments) in our graph, there is, in fact, a 
low branching factor from any one node to the next. 



Proportional Error Tolerance The above algorithm uses the unit error 
and the proportional error £„ which we dehne here. If we uniformly distribute 
the allowable error e for the query, the unit error for a query divided into N 
segments is £„ = e/\/W. By extension, the proportional error for n subquery 
segments is 



= {n/N X + A'] - Q[i])y^^ 

i = l 



= s X \/n/N 
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where K is the offset into the matching sequence. Now we can match a portion 
of the query length within a proportional error tolerance. To clarify, the propor- 
tional tolerance is not a static measure but is scaled for the number of subquery 
segment under consideration. 

A partial match then is dehned as a subsequence that matches a subsection 
of the query within a proportional error tolerance The proportional error 
tolerance is the allowable error for a partial match. For example: if the longest 
partial match is half the length of the query, the allowable error is £„ = e/VS. 

For the partial match case where e is given, the DFS is started at a segment 
where a match is within £„ . With the ST-index we have all the segment errors in 
a vector, so it is possible to keep track of whether the cumulative error is greater 
than the proportional error For the best match query, start at the segment 
with the lowest unit error, and take the most consecutive candidate segments 
and test for errors. 



5 Experiments 

We report two experiments to demonstrate the effectiveness of our method and 
compare with existing methods: a partial match query and a best approximate 
match query. Our method is compared against the ST-index method for each of 
these experiments. We do not show the naive sequential search times since they 
have much longer run times [7] than any of these methods and may distort the 
comparative graphs. Since our DFS method and the ST-index both implement 
a spatial index (R*tree) search initially, we show the R*tree search time to 
illustrate the post-processing cost. For all our experiments, we took the average 
of 10 runs to smooth the random variations. Our system and ST-index prototype 
was written in Java 1.2.2 on Windows NT 4.0 running on an Intel Pentium 
166MHz processor with 96 MB of RAM. 



Partial Match Experiment For the partial match, we keep extending the 
longest path in the graph while the proportional error £„ is within the allowable 
error tolerance. Even though the longest path may not be the full length of the 
query, we can hud the longest partial match which the user may be interested 
to discover. In the partial match experiment, our hypothesis is that we expect 
to hud all the correct matches (100% specihcity) with no major overhead. The 
experiment is designed as follows: 

— Two random subsequences are chosen from the sequence database, and we 
call these query pieces Qi and Q'z- 

— They are smoothed using a 16-point moving average, since we wish to sim- 
ulate a smooth user-dehned shape. This will also introduce some error such 
that it will result in more than one possible match. 

— For each query, calculate the error from the correct positions, and make the 
e error tolerance larger to allow other possible matches. 

— Then concatenate the two query subsequences Q\ and Q 2 into one query Q. 
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Fig. 6. Scalability of Partial Match vs. Fig. 7. Scalability of Best Approximate 
Zi|Q|, |5| = 20000, |FFT| = 32 Match vs. Zi|Q|, |5| = 20000, |FFT| = 32 



— Partial match query is run to find both portions of the query. That is, we 
expect to hnd both Qi and Q 2 as the top two matches using the DFS algo- 
rithm. 

We sought a comparison for our partial approximate match query, but we 
were not aware of any existing methods which were capable of handling such 
queries. We decided to compare these results to the times of the ST-index search, 
even though a correct match is not expected. An ST-index query with the com- 
posite query Q can have variable results. It may hnd one of the two sub-queries 
(ie. Qi or Q2), or hnd another subsequence which does not contain either Qi or 
Q2 but somehow satishes the e tolerance and is considered a match. However, 
these results are not reliable nor predictable and were not considered signihcant 
for our purposes. We will just compare the running times for an ST-index query 
against the DFS partial match query. 

Notice in hgure 7, our method only contributes to a small incremental time 
beyond the R*tree search time and outperforms the ST-index time as the query 
size IQI increases. Our DFS method for post-processing was able to identify the 
subqueries with 100% specihcity, and in comparison with the ST-index, there 
is very little overhead above the R*tree search time. Hence, we can hnd partial 
match queries with minimal overhead and high reliability. 



Best Approximate Match Experiment The ST-index can be modihed to 
hnd the best match by hnding the lowest matching error among the candidate 
sets. Since there is no e dehned, the subtrail pruning step would be very inefh- 
cient, as all candidates would have to be considered. There would be a sorting 
step after the post-processing to hnd the subsequence with the lowest error. 

— For a best approximate match experiment without e, create a random walk 
query, and smooth it using 16-point moving average. A random walk was 
chosen to ensure there would be more than one clear best match. 

— Find the top N best matches using the error measurement for DFS and 
ST-index. Compare the result sets from both methods. 
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The DFS algorithm used here is essentially the same as in the partial match 
experiment, but there is no e to stop the expansion of the subsequence pieces. 
Hence, the comparative times will be similar to that in hgure 6. For this exper- 
iment, we give the run times as a ratio of the ST-index time. Again, since both 
methods use the R*tree search, the ratio Tst /TR^ tree gives the overhead cost 
of the ST-index post-processing. In hgure 7, the DFS method shows almost a 
hxed ratio overhead, while the ratio of the ST-index times is growing with the 
query length. It is worthwhile to remember, the spatial index times also grows 
with query length, so when the ratio grows, the run times become prohibitively 
expensive. 



6 Concluding Remarks 

We were motivated to study approximate time-series matching since current sub- 
sequence matching algorithms, like ST-index, do not consider hexible matching 
scenarios. In this light, we present two realistic problems that have not yet been 
addressed adequately, the partial match and best approximate match problems. 
Our experiments show an effective solution clearly addressing the post-processing 
inefficiencies of the existing methods. 

As mentioned earlier, time-series similarity search has been an active area 
of research. In spite of these many efforts, there are other interesting similarity 
search problems that have not been addressed. In this study, we extended the 
notion of approximate similarity matching dehned in [9]. We gave efficient solu- 
tions to the partial and best approximate match queries. Our method can also be 
used to address the partial best approximate match (without e). In this scenario, 
the longest path with the lowest average unit error is considered the best partial 
match. But there is some issue of how to compare partial matches of different 
lengths. A subsequence with a longer partial match and a higher average unit 
error may be preferable to a shorter segment with a low unit error. The prefer- 
ence would be essentially application-dependent and may be addressed with a 
heuristic function of the partial match length and the average unit error. 

The major contribution of our work is allowing flexible query answering, while 
improving the efficiency of the time-series similarity query processing. Our DFS 
method provides effective solutions for the partial match and best approximate 
match queries. We call these intelligent matching schemes since it unburdens 
the user of having to fully specify a time series query. In the future, we seek to 
extend the intelligence from pattern matching to time-series pattern discovery. 
There has been some initial work in the direction of rule discovery [6], yet we 
feel this is a ripe held for further advancement in time-series data mining. 
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Abstract. Feature Extraction, also known as Multidimensional Scal- 
ing, is a basic primitive associated with indexing, clustering, nearest 
neighbor searching and visualization. We consider the problem of fea- 
ture extraction when the data-points are complex and the distance eval- 
uation function is very expensive to evaluate. Examples of expensive 
distance evaluations include those for computing the Hausdorff distance 
between polygons in a spatial database, or the edit distance between 
macromolecules in a DNA or protein database. 

We propose Cofe, a method for sparse feature extraction which is based 
on novel random non-linear projections. We evaluate Cofe on real data 
and find that it performs very well in terms of quality of features ex- 
tracted, number of distances evaluated, number of database scans per- 
formed and total run time. We further propose Cofe-GR, which matches 
Cofe in terms of distance evaluations and run-time, but outperforms it 
in terms of quality of features extracted. 



1 Introduction 

Feature Extraction, also known as Multidimensional Scaling (MDS), is a basic 
primitive associated with indexing, clustering, nearest neighbor searching and 
visualization. The simplest instance of feature extraction arises when the points 
of a data set are defined by a large number, k, of features. We say that such points 
are embedded in k-dimensional space. Picking out k' k features to represent 
the data points, while preserving distances between points, is a feature extraction 
problem called the dimensionality reduction problem. 

The most straightforward and intuitively appealing way to reduce the num- 
ber of dimensions is to pick some subset of size k' of the k initial dimensions. 
However, taking k' linear combinations of the original k dimensions can often 
produce substantially better features than this naive approach. Such approaches 
are at the heart of methods like Single Value Decomposition (SVD) [10] or the 
Karhunen-Loeve transform [8]. Linear combinations of original dimensions are 
but one way to pick features. Non-linear functions of features have the potential 
to give even better embeddings since they are more general than linear combi- 
nations. 

In many other cases the points are complex objects which are not embedded. 
For example, if the dataset consists of DNA or protein sequences then there is no 
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natural notion of a set, large or otherwise, of orthogonal features describing the 
objects. Similarly, in multimedia applications, the data may include polygons as 
part of the description of an object. While such data types are not represented 
in a feature space, they are typically endowed with a distance function, which 
together with the dataset of objects define a distance space. For example, the 
distance between biological macromolecules is taken to be some variant of the 
edit distance between them. For geometric objects in 2- or 3-dimensions, the 
distance is often measured as the Hausdorff distance. These distance functions 
are very expensive to compute. The edit distance between two sequences of 
length m takes 0{m?) time to compute, while the Hausdorff distance between 
two geometric objects, each with m points, takes time 0{mf) to compute. Even 
though complex objects have a finite representation in the computer, the natural 
feature space this representation describes does not preserve distances between 
objects. For example, the k points of the polygon can be trivially represented by 
0{k) dimensions by a straightforward 0{k) bit computer representation, but the 
vector distance between such embeddings is unrelated to any natural geometric 
distance between the polygons. 

The Complex Object Multidimensional Scaling ( COMDS) problem is then the 
problem of extracting features from objects given an expensive distance function 
between them. A good solution to the COMDS problem has to have good quality: 
Quality: The k features extracted should reflect, as closely as possible, the 
underlying distances between the objects. Furthermore, the extracted features 
should be good even with small k. If we are interested in visualization fc = 2,3. 
Clustering and nearest neighbor searching becomes prohibitively expensive, or 
the quality of the clustering degrades, if k is more than about 10. Thus the 
quality of a COMDS algorithm depends on the quality of a small number of 
extracted features. There is a tradeoff between the quality and the scalability of 
a solution to the COMDS problem. 

A scalable solution should have the following characteristics: 

Sparsity: Since the distance function is very expensive to evaluate, it is not 
feasible to compute all ( 2 ) pairwise distances, where n is the number of elements 
in the database. Thus, the method must compute only a sparse subset of all 
possible distances. 

Locality: As many databases continue to grow faster than memory capacity, the 
performance of any COMDS solution will ultimately depend on the number of 
accesses to secondary storage. Therefore, a COMDS solution should have good 
locality of object references. This can be measured by the number of database 
scans necessary to compute the embedding. Thus, this number should be low. 

We address these issues in designing the {Complex Object Feature Extraction) 
(Cofe) method, an algorithm for the COMDS problem. We evaluate Cofe and 
compare it with FastMap [7], a previously proposed solution for COMDS. 

Features define a metric space. The standard way to define the distance 
between two fc-dimensional feature vectors is through their I 2 (Euclidean) dis- 
tance, that is, if point p has features pi, ...,pk and point q has features qi, ...,qk 

we can interpret the “feature distance” d'{p, q) = \/^i=i{Pi — qi)"^ ■ Taken in this 
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view, we seek the embedding of the real distance function into 3?^, such 

that the induced h distance function d'{-, •) is as similar to d{-, •) as possible. We 
chose to use the Euclidean distance between feature vectors because a consider- 
able amount of database work has focused on indexing and related tasks for low 
dimensional Euclidean data. No other metric combines the robustness of a low 
dimensional Euclidean space with its tractability. A survey of techniques which 
can be applied to low-dimensional Euclidean data can be found in [7]. 

Measuring the Quality of an Embedding. We compared embeddings based 
on the Stress criterion proposed in [13]. The Stress of a distance function d{-,-) 
with respect to another distance function d'{-, •) is defined as: 



Stress{d, d') = 






2 Related Work 

Multidimensional Scaling (MDS) has been used in the literature with different 
meanings, though in this paper we restrict ourselves to the standard meaning 
within the database community [14], as described above. MDS can be applied 
as a preprocessing step for data to be used in indexing, clustering and related 
tasks. The method we propose, called Cofe, is a scalable MDS method. 

Traditional MDS methods [15] do not offer a scalable solution to the COMDS 
problem, because of high complexity and unrealistic resource assumptions. In [7], 
Faloutsos and Lin proposed the FastMap method for MDS. Their innovation was 
to consider sparsity as an important factor for MDS and FastMap is the first 
method we know of which maintains quality and sparsity simultaneously. How- 
ever, Faloutsos and Lin did not consider the problem of scalability - the largest 
instance which they used to evaluate their method had 154 points, consisted 
of 13-dimensional categorical data - and FastMap is not optimized for complex 
objects. Faloutsos and Lin [7] discuss traditional MDS methods [15], their draw- 
backs and compare them with FastMap. We describe FastMap in detail below, 
and make extensive comparisons between FastMap and Cofe. 

Once an MDS method has embedded the database in low-dimensional Eu- 
clidean space, efficient methods for indexing or clustering, to support nearest 
neighbor search or visualization, can be applied. A considerable amount of 
database work has focused on indexing, clustering and related tasks for low 
dimensional Euclidean data. Here, we mention briefly some of the key results in 
this area, in particular for the problem of similarity querying, which has been 
used to mean nearest neighbor searching in a database. 

Multi-dimensional index structures called Spatial Access Methods {SAMs) 
were designed to rely on certain clustering properties of objects in low-dimensional 
space. If these assumptions hold we would expect objects to occupy regions of the 
multi-dimensional space that can be captured by the multi-dimensional index. 
The result of previous research [3, 4, 20] indicated that the indexing techniques, 
which are designed for low-dimensional data, do not perform equally well on 
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high-dimensional data. Some of the most popular database indexing structures 
used for similarity querying are the i?*-tree [5], X-tree [4], 5'5'-tree [20], SR- 
tree [12], TV-tree [16], the Pyramid Technique [2], PK-tree [21]. BUBBLE and 
BUBBLE-FM [9] are two recently proposed algorithms for clustering datasets in 
arbitrary distance spaces, based on BIRCH [22], a scalable clustering algorithm. 
BUBBLE-FM uses FastMap to improve scalability. 

3 Embedding Methods 

FastMap. The approach taken in FastMap [7] for embedding points into k- 
dimensional Euclidean space is based on 3-point linear projections. Consider 
three objects Oa, Ob and Oi, and the distances, d{a, b), d{a, i) and d{h, i), between 
them. Any such distances which satisfy the triangle inequality can be represented 
exactly in two-dimensional Euclidean space (Figure 1). Oa can be assumed to 
be at position (0, 0), and Ob at (0, d(a, h)). To find Oi’s two coordinates {xi, yi), 
we can solve two equations with two unknowns, plugging in the values d{a, i) 
and d{b,i)- oi 




Fig. 1. Projecting point Oi on the line OaOb 
Therefore, each feature is a linear projection defined by a reference set Oa 
and Ob- The question is then how to pick such a reference pair. Faloutsos and 
Lin’s aim was to select a pair along the principal component, a la SVD. Since 
it would be computationally expensive to find the principal component of the 
points, they suggest finding the pair which is furthest apart, which they suggest 
should lie more or less along the principal component. But finding such pair 
is also computationally expensive. So the following heuristic is used: Start by 
picking a point Oa at random. Find the furthest point to Oa by computing its 
distance to all others and consider it Ob- Compute Ob’s distance to all others and 
replace Oa with the most distant point. Repeat the last two steps t times, where 
t is a parameter of the algorithm. The last pair of points Oa and Ob found this 
way are taken to be the reference points. Note that this sketch only describes 
how the first feature is selected. All distance calculations for the second feature 
factor out the part of the distance function which has already been captured by 
the first feature. Otherwise, if Oa and Ob are the furthest points originally, they 
will remain so and all features will be identical. The selection of a reference set 
always requires t one-against-all sets of distance evaluations. Thus, selecting the 
reference points requires 0{tn) distance evaluations. Computing the feature once 
the reference set is known requires another 2n distance evaluations. Therefore, 
k dimensions require 0{tkn) distance evaluations. 

Fast Map- G R. Once a method has extracted k features, the dimensionality of 
the embedding space can be further reduced by selecting a smaller set k' of high 
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quality features from the initial k. The quality of a feature can be quantified in 
terms of how well it reduces the Stress [13]. We propose the Greedy Resampling 
algorithm for picking out the best subset of features, which has the advantage 
of sparsity over traditional methods like SVD [10], the Karhunen-Loeve trans- 
form [8] and even the recent and more efficient SVD-based method [11]. 

FastMap-GR starts by picking out k features using FastMap. Then, each 
single feature is compared in terms of what Stress it produces. Notice that com- 
puting the Stress directly seems to require ( 2 ) distance evaluations, where n is 
the size of the dataset, but we can simply pick out a random sample of distances 
and compare the Stress on just these distances. Thus, picking out the best fea- 
ture is very fast compared to finding the k features to begin with. Once some 
feature fi is selected as the best, we can pick the second feature as the one which 
produces the best Stress when combined with /i . We can proceed in this greedy 
manner until we have reordered all k features by decreasing order of quality. 

Having done this, if we need k' <k features, we can simply take the first k' 
features of the reordered embedding. Thus, Greedy Resampling is a dimension- 
ality reduction method. Note that it does not guarantee to pick out the best k' 
features, but picking out the best k' features takes exponential time, and so we 
suggest greedy resampling as a heuristic. 

Bourgain. Bourgain’s method [6] is not a sparse method, that is, it evaluates all 
( 2 ) possible distances between the n objects of the database. We will therefore 
not compare its performance with those of the other methods presented in this 
section. However, Cofe is based on the Bourgain embedding, so we present this 
method for the sake of exposition. 

Suppose we have a set X oi n elements. Let the distance function between 
elements of X he d : X x X ^ 3?+. We define the distance function D be- 
tween an element of X and a subset X' C X as D{x,X') = \niny^x'{d{x,y)}, 
that is, D{x,X') is the distance from x to its closest neighbor in X' . Let 
R — {Xi, X 2 , ..., Xk} be a set of subsets of X. Then we can define an embedding 
with respect to R as follows: Er{x) = [D{x, Xi), D{x, X 2 ), ■■■, D{x, Xk)]. ft is 
not obvious at first why such an embedding might have any reasonable proper- 
ties. For one thing it is highly non-linear, and so visual intuition tends not to 
be too helpful in understanding the behavior of such an embedding. However, 
we can gain some understanding of this type of embedding by considering the 
case where the points of the metric space are packed tightly into well separated 
clusters. Then, if some set X\ has a point which is in one cluster C\ and not 
in another C 2 , the distance D{x,Xi) will be small for x & C\ and large for 
X G C 2 - The dimension corresponding to Xi will then induce a large distance 
between points in Ci and C 2 . Obviously we cannot count on the input points 
being packed into well-separated clusters, or that the sets Xi have the right ar- 
rangements with respect to the clusters. However, Bourgain showed that there 
is a set of reference sets which works for general inputs. 

The Bourgain embedding, then, is a choice of reference sets, R, and the 
corresponding embedding En. The choice of the sets in R will be randomized 
via a procedure described below: R consists of O(log^n) sets Xi^, which we 
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can think of as naturally organized into columns (k) and rows (/3). Let R = 
{Xis,Xi^ 2 , where k= O(logn) and (3 = 

O(logn). The Bourgain embedding will thus define 0(log^ n) dimensions. We 
select the elements of Xi^j to be any random subset of X of size 2*. Thus, we 
get K sets of size 2, n of size 4, etc., up to (3 of size approximately n. Here is an 
example of the Bourgain embedding. Let d(-, •) be given by the following matrix: 

The algorithm picks 6 reference sets at random, 
3 with 2 elements, and 3 with 4 elements: 



The embedding of and is computed as: 

"q "2 ^ 

Er{x^) = Q Q Q and Er{x^) = 

The distance between x^ and x^, originally 12 
is: ^2(^fl(a:3),^K(a:5))=V32+82+0+0+32+0. 

An excellent presentation of the proof of the correctness for the Bourgain 
embedding can be found in [17]. Bourgain showed [6] that if R is selected as 
above, then the embedding Er has distortion O(logn), where the distortion 
of an embedding is the maximum stretch of any distance, that is d{xi, Xj) < 
\ognl 2 {ER{xi),ER{xj)), and d{xi,Xj) > l 2 {ER{xi) , ER{xj)) / logn. 

At first glance it seems that modifying a distance by as much as a logn 
multiplicative factor would almost completely destroy the original information 
contained in the distance function d(-, •). For example, if the we are embedding 
1024 points, and d{-, •) ranges over a factor of 10, a logn distortion embedding 
could arbitrarily reorder the distance between objects in our dataset. However, 
as with many worst case bounds, the Bourgain guarantee of no more than log n 
distortion is often very far from the actual behavior on real data. Furthermore, 
the logn bound is tight in the sense that there are metrics which require this 
much distortion for any Euclidean embedding. Experimental analysis on mean- 
ingful data rather than on concocted counterexamples is the sine qua non of 
feature extraction evaluation. 

The Bourgain embedding has two very serious drawbacks. First, it produces 
0(log^ n) dimensions. Again, if we are embedding 1024 points, we could end up 
with 100 dimensions, which would be far too many. Second, with high probability, 
every point in the dataset will be selected in some reference set Xij. Thus, we will 
need to compute all ( 2 ) distances in order to perform the Bourgain embedding. 
If we are using the embedding for similarity searching, we would need to compute 
Er^q) for a query point q. This would require a distance evaluation between q 
and every point in X. There would be no point in performing an embedding if 
we were willing to evaluate all n distances between q and the points in X, since 
we are using the embedding in order to facilitate a nearest neighbor search. 

Similarly, the Karhunen-Loeve transform is completely impractical. FastMap 
was, in fact, designed to give a practical alternative to the Karhunen-Loeve 
transform. 





Xl X2 X3 X4 X^ Xq X7 Xs 

xi\\ 0 |12|10| 9 1101 6 I 3 110 
X2 0 8 13 14 10 13 8 
X3 0 11 12 8 11 2 

X4 0 3 7 10 11 

Xs 0 8 11 12 

X6 0 7 8 
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COFE. COFE is a heuristic modification of the Bourgain method which is 
designed to perform very few distance evaluations. The key idea in Cofe is a 
loop interchange. Bourgain’s method is typically thought of as having an outer 
loop which runs through all the points, and then an inner loop which computes 
all the features. We can equivalently imagine computing the first feature for 
every point, then the second feature, etc., until we have computed however many 
features we want. So far, we have not changed anything substantial: whether we 
loop on all points and then on all reference sets, or on all reference sets and 
then on all points, we end up doing the same distance evaluations. Notice that 
the features are defined by a two-dimensional array. Thus, the order in which 
we loop through the features is not determined. For reasons which are suggested 
by the proof of the distortion bound of Bourgain’s algorithm, we will proceed 
row-by-row, that is, we will compute features for all reference sets of size 2, then 
for size 4, etc. 

So finally, we need to know how to actually save on distance evaluations. 
The main idea is as follows: Suppose we already computed k' features for every 
point. Then we have a rough estimate on the distances between points, that is 

d'f.,{p,q) = \j^i=i(pi — qiY, where pi is the feature of p, and similarly for 
q. If the first k' features are good enough, then we can very quickly compute 
the approximate distance from some point to all points in a reference set, using 
these first k' features. We can then consider just the best few candidates, and 
perform full distance evaluations to these points. Concretely, for every point p 
in the dataset X and every reference set (taken in row-major order) the 
approximate distance D{p, Xk) is computed as follows: 

— For every point q € Xk, compute the approximate distance d'f._^{p, q). 

— Select the a points in Xk with smallest d' distance to p. 

— For each point q of these a points, evaluate the true distance d{p, q). 

— Then D{p, Xk) = d{p, q'), where q' is the q with smallest d{p, q). 

b{p, Xk)=^my^s{d{p, y)\S<ZXk, \S\^a,'da&S,'dh&Xk \ S, d'{p, a) < d'{p, &)} 
Thus, for every point, and for every feature, we compute a distances. We get 
0{nka) distance evaluations, if there are n points and k features. 

There are two questions one might ask about the Cofe embedding: does it do 
well in theory and does it do well in practice. The second question, which is the 
subject of this paper, is addressed in Section 4. As far as the theory goes, Cofe 
does not provide the distortion guarantees of Bourgain. Indeed, we have shown 
that no algorithm which evaluates any proper subset of the values of d(-, •) can 
give such a guarantee. However, as noted above, a guarantee of log n distortion 
is not really all that useful, so we turn our attention to the matter of evaluating 
the performance of Cofe on data. 

COFE-GR. Recall FastMap-GR, which uses FastMap to find k features and 
then reorders them using the Greedy Resampling heuristic. Cofe-GR is analo- 
gous: it uses Cofe to find k features and then uses Greedy Resampling to reorder 
the features. We will show that Cofe-GR does a very good job of picking out 
the best features from a Cofe embedding. 
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4 Experimental Comparison 

We compared Cofe, Cofe-GR, FastMap and FastMap-GR in terms of the qual- 
ity of the computed embedding, the execution time for computing the embedding 
and the number of distance evaluations performed during the computation. 

4.1 Protein datasets 

We chose proteins as test data, because the distance function between them is 
very expensive to compute. If proteins are compared at the structure level, just 
one distance computation can take minutes and even hours. Comparing pro- 
teins at sequence level is faster, but still computationally high, if no specialized 
hardware is used. We ran experiments on datasets selected randomly from the 
SwissProt protein database [1]. In [7], only very low dimensional data was used, 
such as 3 to 13 dimensional vector data, with the exception of text data. How- 
ever, computing the distance between text documents, once such preprocessing 
as finding word-counts has been completed, is not expensive and thus not as 
relevant to our method. 

The size of the protein datasets we used ranged from 48 to the whole Swis- 
sProt database consisting of 66,745 proteins. The platform we used for these 
experiments was a dual-processor 300 MHz Pentiumll with 512 MB of memory. 
Cofe and Cofe-GR were implemented in C and tested under Linux. We used 
the original FastMap code supplied by Faloutsos and Lin [7]. Due to excessive 
memory requirements we were not able to run FastMap on datasets larger than 
511. Thus, most of our comparisons will be the average of 10 random subsets of 
size 255 from SwissProt. We find the performance of FastMap and Cofe not to 
degrade significantly with size, and so these data are quite representative. 

We used the Smith- Waterman [19] measure for protein similarity, which is 
based on a dynamic programming method that takes 0 {mim 2 ) time, where mi 
and m 2 are the lengths of the two protein sequences. Since this is a very expensive 
operation, it will significantly affect the running time of the methods that use it. 
If s(a, h) is the similarity value between proteins a and b, the distance measure 
between the two proteins we considered is d{a,b) = s{a,a) + s{b,b) — 2s{a,b) as 
proposed by Linial & al. [18]. 

4.2 The Quality of Embeddings 

For the quality of the embedding we used the Stress measure suggested by Falout- 
sos et al. [7] which gives the relative error the embedded distances suffer from 
on average. The figures report results for embeddings of sizes ranging from one 
to the maximum number of features considered. We used t = 2 for FastMap. 

Notice that in Cofe we can select a number of rows (a) and columns (k) to 
evaluate. The number of features extracted will then be fc = an. We used the full 
Bourgain dimensionality with K = a = logn and a=l for Cofe. 

We measured the Stress on 4,000 distances between randomly selected pairs 
of elements from the dataset. In our comparisons we present the average over 
results on 10 sets of 255 proteins. 

We begin by comparing Cofe with FastMap and Cofe-GR with FastMap- 
GR (Figure 2). Results show that for FastMap the Stress starts at 0.88 for 
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the first feature extracted and constantly decreases with increasing number of 
features, reaching 0.37 for 49 features, which constitutes a 58% reduction. For the 
same range of features, Cofe starts with a much lower Stress value (0.54) and 
ends at 0.33, although the Stress between the two methods after 14 features is 
basically indistinguishable. Comparing Cofe-GR with FastMap-GR we notice 
the quality difference between them. Gofe-GR starts by being 46% better than 
FastMap for the first extracted feature and stays around 50% for the first 10 
features, after which the gap in quality decreases to almost 0 for 46 features. 
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Fig. 2. Average Stress of Cofe vs. FastMap and Cofe-GR vs. FastMap-GR on 10 sets 
of 255 proteins (with error bars) 



The thing to note is that Gofe-GR has its best Stress at 10 dimensions, and 
this Stress is not matched by FastMap-GR, even out to 49 dimensions (though if 
sufficient dimensions were used, FastMap would probably produce a lower- Stress 
embedding) . As noted above, it is quality of an embedding with a small number 
of dimensions which is most important, and so we conclude that Gofe-GR 
produces very high quality embeddings with a small number of features. 

The Effects of Greedy Resampling. Intuitively, the greedy resampled ver- 
sions of the algorithms should have lower Stress for any given number of features, 
unless, of course, we use all features, in which case the greedy resampled algo- 
rithms will match their non-resampled versions. The comparison of the two plots 
in Figure 2 confirms this, although we note that greedy resampling is a heuristic 
and that it is possible to come up with pathological embeddings where greedy 
resampling does not yield good reduction of dimensionality. A striking aspect 
of the results is that Gofe-GR is a significant improvement over Gofe, while 
FastMap and FastMap-GR have roughly similar performance. Gofe-GR starts 
by improving the quality of the first selected feature by 20% over Gofe and 
reaches the minimum Stress of 0.25 for 10 features, whereas FastMap-GR starts 
with a 9% improvement over FastMap and reaches the minimum value of 0.31 
for 46 features. This comparison shows that Gofe can significantly benefit from 
resampling, while FastMap selects the sequence of choices of reference sets very 
close to the one picked through greedy resampling on the same reference set. 
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4.3 The Execution Time 

In addition to comparing the quality of the two methods, we were also interested 
in comparing their execution time. We instrumented the code to get timing 





Fig. 3. The average run time and the number of distance evaluations for Cofe vs. 
FastMap for 255 proteins 

information and also profiled the execution in order to understand where the 
time was spent. Recall that greedy resampling does not require a significant 
number of distance evaluations and very little run time, so we compare only 
Cofe with FastMap. 

The first plot in figure 3 presents the average execution time over runs on 10 
datasets of size 255. It shows results for increasing number of features for Cofe 
and FastMap. Cofe outperforms FastMap by being 2.3 times faster than it for 
the first feature and 2.8 times faster for all 49 features. 

We profiled both methods to determine where the bottlenecks were in the 
computation. As expected, in both cases, more than 90% of the execution time 
was spent in the computation of the distances between data points (proteins). 
The difference in execution time is explained by the difference in the frequency 
of distance computations between the two methods. However, the run times 
measured require some explanation. We would have expected FastMap to run 
about 33% slower than Cofe since it performs about 4/3 as many distance 
evaluations in theory. However, two effects come into play here. First, FastMap 
performs more scans of the data, and we would expect it to pay a penalty in 
terms of slower memory access. Also, the reference set is known ahead of time in 
Cofe. This allows many optimizations, both in terms of possible parallelization, 
as well as in savings of repeated distance evaluations. Such optimizations are not 
possible for FastMap because the reference sets are only determined dynamically. 
Thus, Figure 3 presents actual measured data of optimized code. 

In Figure 3 we also compare the number of distance calls for a dataset of 
255 protein sequences. For a 49 feature embedding 46% of the total number of 
pairwise distances were computed for Cofe and 153% for FastMap. Even for 
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one feature FastMap computes 3.1% of distances compared to 1.5% for Cofe. 
Even if we use the lowest value for t = 1, which means that one reference point 
is completely random, FastMap makes 115% of the total number of pairwise 
distance evaluations for 49 features extracted and 2.3% for the first feature. 

4.4 Scalability 

We studied how the Cofe method scales with increasing dataset sizes both in 
terms of quality and running time. The first plot in figure 4 shows the quality 
of the first 15 dimensions of Cofe-GR. We note very consistent behavior, with 
basically no degradation in the solution over a wide range of dataset sizes. In 
figure 4 we also show the measured run time of Cofe over a variety of data set 
sizes and number of features extracted. The flat initial part of the curve is due 
to startup costs. 




Fig. 4. Quality of Cofe-GR features and run time for Cofe for protein sets of sizes 
ranging from 255 to 8191 



5 Conclusions 

We have proposed two fast algorithms, Cofe and Cofe-GR which map points 
into fc-dimensional space so that distances are well preserved. We assume only 
that we have a distance function between pairs of data-points. Such a setting is 
very natural for data types such as biological sequences and multimedia data. 
Since distance evaluations are very expensive, reducing the number of distance 
evaluations, while preserving the quality of our embedding was our main goal. 

We compared our Cofe method with FastMap [7] on 10 protein datasets, 
each of size 255. We found that our method performs substantially fewer distance 
evaluations than FastMap. Furthermore, the Stress of the embedding is about 
half that of FastMap’s for a reasonable number of dimensions. We conclude that 
Cofe is an extremely effective solution to the Complex Object Multidimensional 
Scaling problem, exhibiting excellent scalability and yielding high quality fea- 
tures. 
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A Analytical Comparison 

In this Appendix we compare Cofe, Cofe-GR, FastMap and FastMap-GR in terms 
of the following three criteria: quality, sparsity and locality. 

Quality. As stated in Section 3, Gofe is based on the Bourgain algorithm, which 
guarantees a distortion of O(logn) for any metric, and there are some metrics which 
require 17(logn) distortion. It is easy to show that no sparse method can have such a 
bound. Consider, for example, the needle-in-a-haystack distance function. All distances 
are some large A, except, perhaps, for some unknown pair a,b which are at distance 
e A apart. Unless we find the pair a, b, we do not know on any unevaluated distance 
if it is A or e. Thus we cannot provide an embedding with distortion of less than A/e, 
a quantity which is not even bounded. But this is not a shortcoming of Cofe, it is a 
shortcoming of every sparse method, including FastMap. 

Similarly, FastMap is based on the Karhunen-Loeve transform, which gives some 
optimality guarantees about picking out the best linear pro jeetions. Notice that it gives 
no guarantees on finding the best features overall, only the best linear features, under 
some criterion. FastMap is a heuristic and gives no guarantees. We conclude by noting 
that there is no sparse method which gives a guarantee under any criterion. 

Sparsity. Above, we showed that FastMap performs 0{tkn) distance evaluations, 
where t is a parameter of the algorithm and Cofe performs 0{akn) evaluations. The 
greedy resampling versions of these algorithms do not perform any significant number 
of extra distance evaluations. 

A comparison of the exact number of distance evaluations performed requires know- 
ing t and a, and analyzing the hidden constants in the O notation. If we consider a 
dataset with n elements, the average computational cost of a distance evaluation in 
the distance space to be D and the cost of computing the Fuclidean distance in an 
i-dimensional space to be di = id where d is the cost of computing a squared difference. 
We consider the number of features extracted to be fe. 

FastMap first chooses a pair of points that define a reference line and then projects 
all points in the dataset on that line. At the next step, this procedure is repeated, except 
that the distances between points are considered projected in a hyperplane orthogonal 
to the chosen reference line. This introduces, in addition to a distance evaluation in 
the distance space, an extra d computation for each feature already extracted. The 
time to execute FastMap consists of the time Tsxtract to extract the k features and 
the time Tproject to project the n points onto them. If for each feature extracted, a 
number of t passes over the dataset are executed while on each pass the furthest point 
to a selected reference point is computed, then Textract = i + (j “ 1)*^) 

and if c distance computations are needed to project a point onto a reference line, 
then Tembed = + “ The number of distance computations c for 

projecting a point on a line is 2, considering that the distance between the reference 
points is known. Thus: TpastMap = Textract + Tembed = nk{t + c) (^D + . 

Computing the Bourgain transform would require randomly selecting the reference 
sets and then computing the embedding of a point in the image space as the smallest 
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distance between the point and the points in the reference sets. Selecting the reference 
sets does not require any computation, and therefore for an embedding with n columns 
and a rows: Tsourgain = 2ntv(2“ — 1)D. 

Instead of the a rows Bourgain would compute, Cofe computes only f3 rows using 
the distance in the distance space. For the rest oi a — f3 rows, the coordinate for the 
corresponding dimension i is approximated by first selecting the a closest points in 
the already computed {i — l)-dimensional image space, then computing the smallest 
distance in the distance space among the a points. 

Extending the j3 row embedding to an a row embedding takes therefore: 

Textension = nK.a{a - (3)D + . Approximating j < ni we 

get: Tcofe<nK{2'^+'--2+a{a-l3)) D + nK({a-l)2°‘+'--l32f^+^+4) d. 

Because computing D is by far more expensive than d, the dominant terms for the 
two methods are: Tp^g^.]^^p = nKa{t + c)D and = nn (^2^^^ +a{a — (3)'j D, where 

the size of the embedding is fc = na. 

For a = logn and p = loglogn: 

TpastMap = nn{t + c)lognD and Fco/e = nK,{{a + 2)logn — aloglogn)D. For this choice 
of parameters, both methods compute 0{logn) real distances. Considering c = 2, and 
k = nlogn, FastMap performs more or less {t + 2)kn distance evaluations and Cofe 
performs approximately {o + 2)kn distance evaluations. The values for t that would 
give comparable numbers of distance computations in the distance space are t = cr or 
t = cr+l. 

Faloutsos and Lin suggested that t = 7, though experiments showed that on the 
data used, very little advantage Is gained by setting t > 2. We found that setting a = 1 
gave very reasonable result, so we set t = 2 and conclude that, as n increases, FastMap 
will perform = | times as many distance evaluations. 

Locality. In order to select one feature, FastMap has to scan the data t times and then 
compute the embedding (projection) using one more scan. For extracting k features, 
the number of scans is: SpastMap = {t + l)k. 

Cofe randomly selects the reference sets that correspond to features. We distin- 
guish two cases: either the reference sets fit in memory or they don’t. Consider the 
second case in which not all reference sets fit in memory. The size of the reference 
sets grows exponentially with the row index of the feature. The smallest sets can be 
considered to fit in memory. Since the only information needed to compute the embed- 
ding corresponding to the bootstrapping features are the points in the reference sets, 
two scans, one to load them and one to compute the embedding for each point in the 
dataset, are needed. To compute the embedding of a point for each extension feature, 
the embedding for the previous features for that point and for the reference set points 
are needed. This can be achieved by loading a few reference sets during a scan and 
then computing their embedding, followed by another scan to compute the embedding 
of the rest of the points in the dataset. The reference sets for the next feature can also 
be prefetched during the computation of the embedding for the current feature. The 
worst case number of scans is therefore: Scofe = k + 1. 

When all points in the reference sets fit in memory, that is, when we compute 
sufficiently few features so that we need not generate very large reference sets, a scan 
for loading the reference set and the computation of their embedding followed by a scan 
during which the whole embedding for each point in the dataset is computed suffices. 
Only 2 scans are required in this case. 

In the worst case and setting t = 2, FastMap performs 3 times as many scans. 
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Abstract. Numerous data mining algorithms rely heavily on similarity 
queries. Although many or even all of the performed queries do not 
depend on each other, the algorithms process them in a sequential way. 
Recently, a novel technique for efficiently processing multiple similarity 
queries issued simultaneously has been introduced. It was shown that 
multiple similarity queries substantially speed-up query intensive data 
mining applications. For the important case of multiple A:-nearest 
neighbor queries on top of a multidimensional index structure the 
problem of scheduling directory pages and data pages arises. This aspect 
has not been addressed so far. In this paper, we derive the theoretic 
foundation of this scheduling problem. Additionally, we propose several 
scheduling algorithms based on our theoretical results. In our experimen- 
tal evaluation, we show that considering the maximum priority of pages 
clearly outperforms other scheduling approaches. 

1. Introduction 

Data mining is a core information technology and has been defined as the major step in 
the process of Knowledge Discovery in Databases [9]. Many data mining algorithms are 
query intensive, i.e. numerous queries are initiated on the underlying database system. 
In the prominent case of multidimensional data spaces, similarity queries, particularly k- 
nearest neighbor (k-nn) queries, are the most important query type [17]. The process of 
finding the A:-nearest neighbors of a given query object is a CPU and I/O intensive task 
and the conventional approach to address this problem is to use some multidimensional 
index structure [16], [10]. 

While several sophisticated solutions have been proposed for single k-nn queries, the 
problem of efficiently processing multiple k-nn queries issued simultaneously is rela- 
tively untouched. However, there are many applications where k-nn queries emerge 
simultaneously. The proximity analysis algorithm proposed in [13], for instance, per- 
forms a k-nn query for each of the “top-k” neighbor objects of any identified cluster in 
the database. The outlier identification algorithm proposed in [6] performs a k-nn query 
with a high value of k for each database object. Another example is the classification of 
a set of objects which can efficiently be done by using a k-nn query for each unclassified 
object [14]. In all these algorithms k-nn queries are processed sequentially, i.e. one after 
another. However, since many or even all queries do not depend on each other, they can 
easily be performed simultaneously which offers much more potential for query optimi- 
zation. In [2], a novel technique for simultaneous query processing called multiple sim- 
ilarity queries has been introduced. The authors present a syntactical transformation of 
a large class of data mining algorithms into a form which uses multiple similarity que- 
ries. For the efficient processing of the transformed algorithms, the authors propose to 
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load a page once and immediately process it for all queries which consider this page as 
a candidate page. It has been shown that this results in a significant reduction of the total 
I/O cost. By applying the triangle inequality in order to avoid expensive distance calcu- 
lations, further substantial performance improvements can be achieved. If the underly- 
ing access method is scan-based, this scheme is straightforward. However, when per- 
forming multiple A:-nn queries on top of a multidimensional index structure, the 
important question emerges in which order the pages should be loaded. This problem 
has not been addressed in [2]. In this paper, we study the theoretical background of this 
scheduling problem. In particular, by developing a stochastic model, we find the expect- 
ed distance to a nearest neighbor candidate located in a considered page to be the key 
information in order to solve the scheduling problem. Then we propose and evaluate 
several scheduling techniques which are based on our theoretical results. 

The rest of this paper is organized as follows. In section 2, we describe the general 
processing of multiple k-nn queries. In section 3, we study the theoretical background of 
the scheduling problem, and in section 4, we propose several scheduling techniques. 
The experimental evaluation is presented in section 5, and section 6 concludes the paper. 

2. Multiple k-Nearest Neighbor Queries 

We shortly describe the algorithm for processing multiple k-nn queries (cf. Fig. 1) which 
is based on the HS single k-nn algorithm proposed in [12]. The algorithm starts with m 
query specifications each consisting of a query point and an integer k,-. For each query 
an active page list (APT) is created. An APT is a priority queue containing the index 
pages in ascending order of the minimum distance MINDIST between the correspond- 
ing page regions and the query point. In contrast to single query processing, the multiple 
k-nn algorithm maintains not one APT at a time but m APLs simultaneously. While at 
least one k-nn query is running, the algorithm iteratively chooses a page P and each 
query qi which has not pruned P from its APT,- (i.e. P is still enlisted in APT,- or P was 
not yet encountered) processes this page immediately. Thus, we only have one loading 
operation even if all queries process P. When processing a page P, the algorithm addi- 
tionally saves valuable CPU time by avoiding distance calculations which are the most 
expensive CPU operations in the context of k-nn search. The method process(H qj) does 
not directly calculate the distances d{qj, o^) between objects located on P and a query 
object qj, but instead, it first tries to disqualify as many objects as possible by applying 
the triangle inequality in the following way: if 

\d{qi, qd - d{q-, oJ\>knn_6\s\{q-) , i <j, holds, then oT > knn_dist(^ ) is 
true and the distance d(qj, o^) needs not to be calculated. The inter-query distances 
d(qi, qp are precalculated and knn_dist(i7^) denotes the k-nn distance of query qj deter- 
mined so far. For all objects which cannot be disqualified, the distance to qj is computed 
and stored in the buffer dist_buffer. Note that disqualified distances cannot be used to 
avoid distance computations of other query objects. Obviously, the method 
choose_candidate_page() is crucial for the overall performance of the algorithm, since 
a bad choice results in unnecessary and expensive processing of pages. Thus, it is impor- 
tant to find a schedule which minimizes the total number of processing operations. 
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DB: :multiple_knn_queries (g^/ •••/ Qm' guery_obj ect , 

kj_, . . . , k^: integer) 



begin 

precalculate_interquery_distances ( [g^, . . . , g^j,] ) ; 

for i from 1 to m do create_APL (g^) ; 
while queries_running ( [qj_, . . . , g^] ) do 

P = choose_candidate_page ( [APL^, • • • , APLj„] ) ; 
for i from 1 to m do APL^ . delete ( P) ; 
load ( P) ; 

initialize_dist_buf f er ( ) ; 
for i from 1 to m do 

if isa_candidate_page ( P, g^) then 
process (P, g^, . . . ) ; 
return ( [kIsrN_i, . . . , kNN^] ) ; 
end. 



DB :: process (page P, query_object g^, ...) 

begin 

foreach object o in P do 

if not avoid_dist_computation ( o, g^, dist_buffer) then 
dist_buf f er [o] [g^] = distance(o, gi) ; 
if dist_buf f er [o] [g^] < knn_dist(g^) then 

if isa_directory_page ( P) then APL^ . insert ( o) ; 
else {* P is a data page *) 
kNN^ . insert (o) ; 

if kNNj . cardinality ( ) > kj_ then 
kNN^ . remove_last_ob j ect ( ) ; 

APL^ . prune_pages ( ) ; 

return ; 
end. 



Fig. 1. Multiple ^-nearest neighbor algorithm 



3. The Pruning Power Theory 

The HS algorithm for processing single k-m\ queries loads exactly those pages which are 
intersected by the fc-nn sphere in ascending order of the MINDIST. In its basic form, the 
HS algorithm is heavily I/O bound. This behavior changes if we are moving from pro- 
cessing single fe-nn queries to processing multiple fe-nn queries. To explain this fact, we 
have to consider the loading operation load(P) and the processing operation 
process(P, ^,). When P is a data page, the processing operation determines the distances 
between the query point and all data points on page P which cannot be avoided by 
applying the triangle inequality (cf. section 2). In the worst case, the total number of dis- 
tance computations to be performed for a page P with capacity Cgff is m ■ Cgff, i.e. no 
distance computation could be avoided. The cost for loading P, on the other hand, is 
independent of the number of queries which actually process this page and therefore 
remains constant. As a consequence, the algorithm may switch from I/O bound to CPU 
bound. This switch, however, does not only depend on the number m of simultaneous 
queries but also on the question, how early the algorithm is able to exclude pages from 
processing. 

3.1 Problem Description 

Whenever the distance of the A:-nn candidate for some query <i<m) decreases, all 
pages Pj having a MINDIST larger than the new A:-nn candidate distance are excluded 
from being processed for this query (they are pruned off APL,). If a page is pruned for 
many queries but a few, the effort of the loading operation must be performed whereas 
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Fig. 2. Example 



many processing operations are avoided in advance. Therefore, pruning saves valuable 
processing time and the algorithm switches back to I/O bound. 

The central question we have to address is, what is a suitable order of pages that 
prunes as many processing operations process(/y, qj) as possible? For showing that this 
is not a trivial task, let us consider the following example in Fig. 2 . Applying the HS 
algorithm for 1 -nn queries, query qj would first access page Pj and then Pj. Query q2 
would access the page sequence P2, Pj, and P^. Finally, q^ would only access Pj and 
then stop immediately. In multiple query processing, all three pages must be loaded into 
main memory, because each page is relevant for at least one query. Considering the 
sequence, however, makes a big difference with respect to the processing operations. 
For instance, a very bad choice is to start withP2. After loading P2, we have to perform 
all distance calculations between the points on P2 and all query points since using the 
triangle inequality cannot avoid any distance calculation. Additionally, the data points 
found on page P2 do not exclude any of the remaining pages P j and P^ from any query, 
because the current 1 -nn candidates still have a high distance from the query points. 
Therefore, the pruning distances are bad, or in other words, P2 does not yield a high 
pruning power. A much better choice is Py since once Pj is loaded and processed, P j and 
P2 are pruned for query q^, because the pruning distance of q^ becomes fairly low. If Py 
is loaded next, P2 can additionally be pruned for qj. Finally, P2 is loaded, but only the 
distances to ^2 must be determined since P2 is pruned for all other queries. Thus, we 
have saved 3 out of 9 processing operations compared to a scheduling sequence starting 
with page P2. 

3.2 The Pruning Power Definition 

This simple example already demonstrates that a good page schedule is essential in 
order to further improve the processing of multiple A:-nn queries. Thus, our general 
objective is to load pages which have a high pruning power. We define the pruning 
power as follows: 

Definition 1 : Pruning Power 

The pruning power of a page P^ is the number of processing operations process(P^, qfi 
\<s<num _pages, si^r, \<i<m, which can be avoided if page P^ is loaded and processed. 

According to this definition, the pruning power of a page is exactly known only at the 
time when the page has been processed. Recall that processing operations can be avoid- 
ed only if the A:-nn candidate (and, thus, the pruning distance) for some query changes. 
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In general, it is not possible to determine in advance whether such a change occurs. 
Therefore, our goal is first to capture more precisely the conditions under which a page 
has a high pruning power, and then to develop heuristics which select pages that obey 
these conditions. The core problem is to determine an expected value for the 

radius for which the intersection volume contains a number k" of points: 
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where k’ is the number of points which are needed to have a set of k points in total. A 
difficult task here is to determine the intersection volume, which can be solved by a 
tabulated Montecarlo integration [7]. Although tabulation makes this numerical method 
possible, the performance is not high enough, since eq. (1) must be evaluated often. 

A promising approach to address such problems is to introduce heuristics which in- 
herently follow the optimal solution. When considering single k-nn algorithms proposed 
in the literature, e.g. [12], [15], we can extract two substantial measures which are asso- 
ciated with the pruning power: the minimum distance MINDIST and the priority of a 
page in the corresponding APL. Both measures have been successfully used to schedule 
pages for a single k-nn query. Additionally, it has been shown that other measures, e.g. 
the minimum of the maximum possible distances from a query point to a face of a 
considered page region, perform only poorly. Thus, we develop three heuristics in the 
following section which select pages with a high pruning power on the basis of these two 
measures, i.e., using the priority of pages in APLs and using the distance MINDIST. 



4. Pruning Power Scheduling Strategies 



4.1 Average Priority Technique 

This heuristic is based on the observation that a page can only have a high pruning power 
if it has a high local priority in several APLs. If the page has a low priority in most of the 
APLs, it is obviously very unlikely that this page yields a high pruning power. When 
selecting the next page to process, for each page Pj the ranking numbers RN(APL,-, Pj) 
in all queues APL,- are determined and summed up. If, for instance, Pj is at the first posi- 
tion for qi, at the fifth position for q 2 and at the 13th position for 173, we get a cumulated 
ranking number of 19 for Pj . This number is divided by the number (3 in this example) 
of APLs that contain Pj in order to form an average and to avoid the underestimation of 
pages that are already pruned for many queries. The page with the lowest average rank- 
ing P is selected to be loaded and processed next. With npqj we denote the number of 
priority queues which contain page P,-. 
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4.2 Average Distance Technique 

The average distance strategy is similar to the average priority heuristic. The focus of 
this approach is not the position of a page in the APLs, but directly the minimum dis- 
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tance Pj) between a page region Pj and a query point q^. The motivation is the 

observation that the pruning power of a page is monotonically decreasing with increas- 
ing distance to the query point, i.e. when a page is located far away from most query 
points, this page probably has a low pruning power. For each page, the distances to all 
query points (except those for which the considered page is already pruned) are deter- 
mined and summed up. Again, the average is formed by dividing the cumulated distance 
by the number of all queries that did not yet prune the page. The page with the least aver- 
age distance P is selected for being processed next. 

m m 

£MD„,,p 

= 1 J- = 1 

4.3 Maximum Priority Technique 

The motivation for this strategy is the observation that pages with maximum priority 
(i.e. they are at the top position) in one or more priority queues often have a high pruning 
power even if they have very low priority for other queries. The reason is that pages with 
maximum priority generally are more likely to contain one of the k nearest neighbors 
than the pages on the following positions. Therefore, a page which is at the first position 
for few queries may outperform a page that is at the second position for many queries. 
Like the average priority technique, the maximum priority technique determines the 
ranking of all pages with respect to the position in the priority queues. In contrast to the 
average priority technique, it counts the number MPC(APL^, ..., APL^, Pj) of queries 
qi, for which the page Pj has the maximum priority. The page yielding the highest count 
P is selected as the next page to be processed. When two or more pages have an equal 
count, we consider position-two (and subsequent) counts as a secondary criterion. 

P = MAX (MPC(APLj, ...,Pj), ..., MPC(APLj, ...,pp) (4) 

5. Experimental Evaluation 

In order to determine the most efficient scheduling technique we performed an extensive 
experimental evaluation using the following databases: 

• Synthetic database'. 1,600,000 8-<7 random points following a uniform distribution. 

• CAD database: \6-d Fourier points corresponding to contours of 1,280,000 indus- 
trial parts used in the S3-system [4]. 

• Astronomy database: 20-d feature vectors of 1,000,000 stars and galaxies which are 
part of the so-called Tycho catalogue [11]. 

All experiments presented in this section were performed on an Intel Pentium 11-300 
workstation under Linux 6.0. The index structure we used was a variant of the X-tree 
where the directory consists of one large supernode. We used a block size of 32 KBytes 
for the X-tree and the cache size was set to 10% of the X-tree size. 

For index structures with a hierarchically organized directory, the proposed schedul- 
ing techniques can be applied as dynamic approaches or as hybrid approaches (sequenc- 
es of static schedules). However, with the directory consisting of one large supemode, 
we are able to apply the techniques in a purely static way: Since we can completely 
construct the APL once a query point is provided, we can also determine a page schedule 
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Fig. 3. Query size on the CAD database Fig. 5. Query size on the synthetic database 

before we start query processing. While this is straightforward for the average priority 
and the average distance technique, we have to slightly modify the maximum priority 
technique. First, we determine all pages having a maximum priority and sort them with 
respect to their counts. Assuming that those pages are already processed, we consider all 
pages located on the second position of any APL as pages with maximum priority and 
again sort them. This step iterates until all pages are enlisted in the schedule. Addition- 
ally, for all static approaches we check if a chosen page is still needed by any query 
before loading the page. We experimentally evaluate the following page scheduling 
techniques (cf. section 4): 

• Static and Dynamic Average Priority (denoted as SAvgPos and DAvgPos) 

• Static and Dynamic Average Distance (denoted as SAvgDist and DAvgDist) 

• Static and Dynamic Maximum Priority (denoted as SMaxPrio and DMaxPrio) 

We first investigate the effect of the query size using the CAD database. The maximum 
number m of multiple A:-nn queries in the system is set to 20 and we make the simplifying 
assumption that at any time there are enough queries in a queue waiting to be processed 
to load the system completely. The query size k varies and Fig. 3 depicts the average 
query cost including the CPU and I/O cost. We can observe that DMaxPrio clearly out- 
performs all other scheduling techniques for all values of k. Compared to the second best 
technique DAvgDist, the DMaxPrio approach exhibits 85% of the average query cost for 
k=\ and 83% of the average query cost for k = 100. Considering the SAvgPos and the 
SAvgDist approaches (their performance plots are almost identical), the average query 
cost of DMaxPrio is only 32% (70%) of the corresponding average query cost for 
1- (100-) nearest neighbor queries. All dynamic techniques outperform the static 
approaches up to A: = 50. For k > 50, SMaxPrio starts to outperform the DAvgPos 
approach. With increasing k, the performance gain of DMaxPrio compared to the other 
techniques decreases. The reason is that with increasing query size, the distance to the 
k-tm (pruning distance) increases and fewer distance calculations can be avoided by 
applying the triangle inequality. However, even for high values of k the DMaxPrio 
approach saves 12% - 30% of the query cost compared to the other techniques. 
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Fig. 6. Database size on the CAD database Fig. 7. # of mult, queries (Astronomy db) 

The same experiment is performed on the synthetic database (cf. Fig. 5). As before, 
the DMaxPrio approach yields the best overall performance and leads to an average 
query cost of 36% - 59% for k=\ and 7% - 34% for k = 100 of the average query cost 
of the other scheduling techniques except for DAvgDist. In this experiment, the 
DAvgDist approach shows a comparative performance as DMaxPrio and even outper- 
forms DMaxPrio for k = 50. For all values of k, the dynamic techniques provide a much 
better performance than the static techniques. Considering only the static approaches, 
we observe that SMaxPrio outperforms SAvgPos and SAvgDist. 

Next, we analyze the impact of the database size on the scheduling techniques. We 
used the CAD database and increased the number of Fourier points from 12,800 up to 
1 ,280,000. We kept the maximum number m of multiple k-nn queries at 20 and per- 
formed 10-nn queries (cf. Fig. 6). For small database sizes, all scheduling techniques 
show similar performance. When increasing the database size, this situation changes 
drastically: For database sizes larger than 16 MBytes, the performance of SAvgPos and 
SAvgDist degenerates whereas DMaxPrio and DAvgDist show a comparatively moder- 
ate increase of the average query cost {DMaxPrio again outperforms DAvgDist). This 
can be explained by the following observation: With increasing database size, the aver- 
age length of the APLs also increases and the static approaches more and more suffer 
from the lack of information resulting from processing candidate pages. The average 
query cost of SMaxPrio shows an acceptable increase up to 48 MBytes. For 80 MBytes, 
however, also this approach exhibits poor performance. Considering the dynamic sched- 
uling techniques, DAvgPos has the worst performance for large database sizes. 

Since dynamic approaches generally introduce some computational overhead for de- 
termining the page schedule, we also investigated the system parameter m which typi- 
cally depends on hardware aspects and on the underlying application. We used the as- 
tronomy database and performed 20-nn queries while increasing the maximum number 
m of multiple queries (cf. Fig. 7). For all scheduling techniques, the average query cost 
clearly decreases with increasing m which underlines the effectiveness of the multiple 
query approach. Again, DMaxPrio yields the best overall performance (it outperforms 
all other techniques for m > 5). An important result is the following observation: While 
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Query size (k) 

Fig. 8. Effectiveness of the multiple query approach on the CAD database 

for small values of m the dynamic approaches obviously outperform the static approach- 
es, we observe that SAvgPos and SAvgDist outperform DAvgPos and DAvgDist for 
m = 60. The reason for this result is the increasing cost of dynamically calculating the 
scheduling criterion since the number of APLs that must be analyzed is directly propor- 
tional to the number of queries in the system. The DMaxPrio approach, on the other 
hand, exhibits excellent performance even for high values of m due to the fact that the 
decision criterion can be evaluated at almost no extra cost, since in average only the top 
elements of each APL have to be analyzed. 

The objective of our last experiment is to show the efficiency of our approach in general. 
We compared the most efficient pmning power scheduling technique {DMaxPrio) with 
the multiple queries technique using an efficient variant of the linear scan, namely the VA- 
file [17], and we compared it with conventional query processing using the X-tree. For this 
experiment, we used the CAD database, set m to 20 and varied the query parameter k from 
1 to 100. The average query cost with respect to k is depicted in Fig. 8. Our new scheduling 
technique clearly outperforms the conventional X-tree query processing by speed-up fac- 
tors ranging from 2.72 for 1-nn queries to 1.78 for 100-nn queries. Considering the VA-file 
using the multiple query scheme {VA-file mult, queries), we can observe that the average 
query cost of DMaxPrio is less than the average query cost of the multiple query VA-file 
for all values of k. Flowever, this does not hold for all scheduling techniques. For instance, 
using DAvgPos or a static approach (e.g. SMaxPrio) for the page scheduling, the multiple 
query VA-file outperforms the multiple query X-tree already for k > 10 (compare with 
Fig. 3). This result underlines the importance of finding an efficient and robust scheduling 
technique in order to maximize the performance improvement resulting from the multiple 
query scheme. 

6. Conclusions 

In this paper, we have studied the problem of page scheduling for multiple k-tm query pro- 
cessing prevalent in data mining applications such as proximity analysis, outlier identifica- 
tion or nearest neighbor classification. We have derived the theoretic foundation and found 
that the pruning power of a page is the key information in order to solve the scheduling 
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problem. The pruning power results from the distance between the A:-nn candidates located 
in a data page and the query points. We have proposed several scheduling algorithms which 
base on the pmning power theory. An extensive experimental evaluation demonstrates the 
practical impact of our technique. For future work, we plan to analyze our technique in par- 
allel and distributed environments. 
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Abstract. The most popular data mining techniques consist in searching data- 
bases for frequently occurring patterns, e.g. association rules, sequential pat- 
terns. We argue that in contrast to today's loosely-coupled tools, data mining 
should be regarded as advanced database querying and supported by Database 
Management Systems (DBMSs). In this paper we describe our research proto- 
type system, which logically extends DBMS functionality, offering extensive 
support for pattern discovery, storage and management. We focus on the system 
architecture and novel SQL-based data mining query language, which serves as 
the user interface to the system. 



1 Introduction 

The primary goal of data mining is to discover frequently occurring, previously un- 
known, and interesting patterns from large databases [8], The discovered patterns are 
usually represented in the form of association rules or sequential patterns. The results 
of data mining are mostly used to support decisions, observe trends, and plan market- 
ing strategies. For example, the association rule "A & F -> D" states, that the purchase 
of the products A and F is often associated with the purchase of the product D. 

From the conceptual point of view, data mining can be perceived as advanced data- 
base querying, since the resulting information in fact exists in the database, but it is 
difficult to retrieve it. For this reason, there is a very promising idea of integrating 
data mining methods with DBMSs [14][17], where users specify their problems by 
means of data mining queries. This leads to the concept of on-line data mining, fully 
supported by the DBMS architecture (similarly to OLAP). 

We have built a research prototype system, called RD2, which logically extends 
DBMS functionality and allows users to mine relational databases. Users or user ap- 
plications communicate with RD2 by means of MineSQL language, used to express 
data mining problems. MineSQL is a multipurpose, declarative language, based on 
SQL, for discovering, storing and managing association rules and sequential patterns. 
The novel ideas of data mining views and materialized views are also implemented in 
RD2 and supported by MineSQL. In this paper we focus on MineSQL language and its 
usage in the area of data mining. 
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1.1 Basic Definitions 
Association rules 

Let L={lj, I 2 , .... /;„} be a set of literals, called items. Let a non-empty set of items Tht 
called an itemset. Let D be a set of variable length itemsets, where each itemset T^L. 
We say that an itemset T supports an item xeLif x is in T. We say that an itemset T 
supports an itemset XcL if T supports every item in the setX. 

An association rule is an implication of the formX-^L, where XclL, YclL, XnL=0. 
Each rule has associated measures of its statistical significance and strength, called 
support and confidence. The rule X-^Kholds in the set D with support s if 100*5% of 
itemsets in D support AcyF. The rule A-^Fhas confidence c if 100*c% of itemsets in 
D that support A also support F 

Sequential patterns 

Let L = {li, I 2 , Im} be a set of literals called items. An itemset is a non-empty set of 
items. A sequence is an ordered list of itemsets and is denoted as <A; X 2 ... X„>, where 
Xi is an itemset (Xj a L). Xf is called an element of the sequence. Let D be a set of 
variable length sequences, where for each sequence S = <A; X 2 ... X„>, a timestamp is 
associated with each a;. 

With no time constraints we say that a sequence A = <A; X 2 ... X„> is contained in a 
sequence F = <F; Y 2 ... Y„> if there exist integers ij < i 2 < ... < i„ such that A; c F;, X 2 
C Yi 2 , ..., X„ a Yi„. We call <Yu, Yp, .... F„ > an occurrence of Ain F. We consider the 
following user-specified time constraints while looking for occurrences of a given 
sequence: minimal and maximal gap allowed between consecutive elements of an 
occurrence of the sequence (called min-gap and max-gap), maximal duration (called 
time window) of the occurrence and time tolerance, which represents the maximal 
time distance between two itemsets to treat them as a single one. 

A sequential pattern is a sequence whose statistical significance in D is above user- 
specified threshold. We consider two alternative measures of statistical significance 
for sequential patterns: support and number of occurrences. The support for the se- 
quential pattern <A; X 2 ... X„> in the set of sequences D is the percentage of sequences 
in D that contain the sequential pattern. While counting the support it is not important 
how many times a pattern occurs in a given data sequence. This makes support unsuit- 
able when sequential patterns are mined over a single data sequence (|D| = 1). In such 
case, the number of occurrences is more useful as a statistical measure. 

Generalized patterns 

In many applications, interesting patterns between items often occur at a relatively 
high concept level. For example, besides discovering that “40% of customers who 
purchase soda_03, also purchase potato chips lT , it could be informative to also 
show that “100% of customers who purchase any of beverages also purchase po- 
tato_chips_12'\ Such association rules or sequential patterns utilize conceptual hier- 
archy information and are called, respectively, generalized association rules and gen- 
eralized sequential patterns. 

A conceptual hierarchy, also called taxonomy, consists of a set of nodes organized 
in a tree, where nodes in the tree represent values of an attribute, called concepts. The 
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generalized value of an attribute is the replacement of its value with its ancestor lo- 
cated on a given level in the conceptual hierarchy. Conceptual hierarchies are pro- 
vided by domain experts or users, or generated automatically. 



1.2 Related Work 

The problem of mining association rules was first introduced in [1] and an algorithm 
called AIS was proposed. In [3], two new algorithms were presented, cdX\tA Apriori 
and AprioriTid that are fundamentally different from previous ones. In [6] a new algo- 
rithm, called Max-Miner, was introduced to efficiently mine long association patterns. 
In [5] an algorithm exploiting all user-specified statistical constraints (including 
minimum support and confidence) was presented. In [10] and [15] the problem of 
finding generalized (also called multiple-level) association rules based on a user- 
defined taxonomy was addressed. 

The problem of mining frequent patterns in a set of data sequences together with a 
few mining algorithms was first introduced in [4]. The class of patterns considered 
there, called sequential patterns, had a form of sequences of sets of items. The statisti- 
cal significance of a pattern (called support) was measured as a percentage of data 
sequences containing the pattern. In [16], the problem was generalized by adding 
taxonomy on items and time constraints such as min-gap, max-gap and sliding window 
(in this paper called tolerance). 

Another formulation of the problem of mining frequently occurring patterns in se- 
quential data was given in [13], where discovered patterns (called episodes) could 
have different type of ordering: full (serial episodes), none (parallel episodes) or par- 
tial and had to appear within a user-defined time window. The episodes were mined 
over a single event sequence and their statistical significance was measured as a per- 
centage of windows containing the episode (frequency) or as a number of occurrences. 

In [11], the issue of integrating data mining with current database management 
systems was addressed and a concept of KDD queries was introduced. Several exten- 
sions of SQL were presented to handle association rules queries [7][9][12]. In [12], 
storing of discovered rules in a rulebase was discussed. 

In recent years many data mining projects have been developed. Some of them re- 
sulted in commercial products. The two particularly interesting data mining systems 
are DBMiner [9] and IBM Intelligent Miner (which evolved from the Quest project 
[2]). They are both multi-purpose data mining tools that can be used to discover vari- 
ous frequently occurring patterns in data but they also offer support for high-level 
tasks such as classification or clustering. Both tools have a graphical user interface 
that can be used to select the source dataset, the desired data mining method and pa- 
rameters required by the selected method. In addition, DBMiner offers a SQL-Vke data 
mining query language (DMQL) that can be used to specify data mining tasks. Al- 
though the systems currently available offer some level of integration with DBMSs, 
they do not provide mechanisms for storage and management of data mining results. 
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2 RD2 Architecture 

RD2 is a data mining system for relational databases, built initially on top of Oracle 
database management system. RD2 discovers association rules and sequential patterns 
in database tables, according to users' needs expressed in the form of data mining 
queries. The system uses RD2 Network Adapter to communicate with user applica- 
tions and RD2 Database Adapter to communicate with the database management sys- 
tem. The RD2 system can work in 2-tier architecture, residing on the database com- 
puter, as well as in 3 -tier architecture, when residing on a separate computer. Figure 1 
gives an overview of RD2 system architecture. 



Fig. 1 . RD2 Architecture 

RD2 Network Adapter is a module for network communication between a client 
application and RD2 server. It uses TCP/IP protocol to transmit data mining queries 
generated by the client application to the server and to transmit the discovered asso- 
ciation rules or sequential patterns from the server back to the client application. RD2 
Network Adapter contains the programmer’s interface (RD2 API), which is used by 
client applications, cooperating with RD2. Advanced users can also use a tool, called 
RD2 FrontEnd, which is an RD2 application for ad-hoc data mining. Users can exe- 
cute their data mining queries and watch the results on the screen (see Figure 2). 

RD2 Database Adapter provides transparent access to various database manage- 
ment systems. Primarily implemented on Oracle DBMS (using Oracle Call Interface), 
RD2 architecture is independent on the DBMS vendor. RD2 Database Adapter trans- 
lates its API calls into native DBMS functions. The adapter can communicate with 
both local and remote DBMSs. 

The client application requests are expressed in the form of MineSQL data mining 
queries. RD2 MineSQL Parser is used for syntactic and semantic analysis of the 
queries. It builds the parse free for a query and then calls the appropriate query proc- 
essing procedures. 

Association Rules Miner is a module for discovering all association rules which 
satisfy user-specified predicates. The association rules are discovered from the data- 
base accessed via Database Adapter while the predicates are extracted from MineSQL 
queries by MineSQL Parser. Our own constraints-driven algorithm for association 
rules discovery is used to perform the task. 
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Fig. 2. RD2 FrontEnd application 
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Sequential Patterns Miner is a module for discovering all sequential patterns 
which satisfy user-specified predicates. The sequential patterns are discovered from 
the database accessed via Database Adapter while the predicates are extracted from 
MineSQL queries by MineSQL Parser. 

RD2 Snapshot Refresher is an independent module responsible for updating ma- 
terialized views defined on data mining queries. It re-executes the view queries in 
regular time intervals and stores the refreshed view contents in the database. 



3 MineSQL Data Mining Query Language 

MineSQL is a declarative language for expressing data mining problems. It is aSQL- 
based interface between client applications and RD2 system. MineSQL plays similar 
role to data mining applications as SQL does to database applications. MineSQL is 
declarative - the client application is separated Ifom the data mining algorithm being 
used. Any modifications and improvements done to the algorithm do not influence the 
applications. MineSQL follows the syntax philosophy of SQL language - data mining 
queries can be combined with SQL queries, i.e. SQL results can be mined and 
MineSQL results can be queried. Thus, existing database applications can be easily 
modified to use RD2 data mining. MineSQL is also multipurpose - it can be used for 
different types of Ifequent patterns: association rules, sequential patterns, generalized 
association rules, and generalized sequential patterns. MineSQL also supports storage 
of the discovered patterns in the database - both association rules and sequential pat- 
terns can be stored in database tables similarly to alphanumeric data. Repetitive data 
mining tasks can be optimized by means of data mining views and materialized views. 



3.1 New SQL Data Types 

MineSQL language defines a set of new SQL data types, which are used to store and 
manage association rules, sequential patterns, itemsets and itemset sequences. The 
new data types can be used for table column definitions or as program variable types. 

The SET OF data types family is used to represent sets of items, e.g. a shopping 
cart contents. SET OF NUMBER data type represents a set of numeric items, SET OF 
CLLAR represents a set of character items, etc. Below we give an example of a state- 
ment, which creates a new table to store basket contents. 

CREATE TABLE SHOPPING (ID NUMBER, BASKET SET OF CHAR) ; 

In order to convert single item values into a SET OF value, we use a new SQL group 
function called SET. Below we give an example of a SQL query, which returns the 
SET OF values from normalized market baskets. 

SELECT SET (ITEM) 

FROM PURCHASED_ITEMS GROUP BY T_ID; 

SET (ITEM) 



PURCHASED_ITEMS : 
T_ID ITEM 

1 A 

2 A 

2 C 

2 D 



A 

A, C, D 
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We define the following SQL functions and operators for the SET OF data types: 
ITEM(x,i) - the z-th item value of the set x (lexicographical ordering presumed), 
SIZE(x) - the number of items in the setx, s CONTAINS q - TRUE if the set s contains 
the set q, s UNION q - the union of the sets s and q, s MINUS q - the difference of the 
sets s and q, s INTERSECT q - the intersection of the sets s and q. 

The RULE OF data types family is used to represent association rules, containing 
body, head, support and confidence values. RULE OF NUMBER data type represents 
association rules on numeric values (e.g. "1 & 3 -> 5"), RULE OF CHAR data type 
represents association rules on character values (e.g. "beer & diapers -> chips”), etc. 
The following statement creates a new table to store association rules. 

CREATE TABLE MYRULES (ID NUMBER, AR RULE OF CHAR) ; 

A user can insert an association rule into the table manually, using a new5gi function 
called TO RULE, e.g.: 

INSERT INTO MYRULES (ID,AR) VALUES(15, TO_RULE ( ' A, D ' , ' B , C ' , 0 . 8 , 0 . 1 ) ) ; 
SELECT * FROM MYRULES; 

ID AR 

T5 a & D -> B & C (0.8, -0.1) 

We also define a set of the following SQL functions and operators that operate on 
rules: BODY(x) - the SET OF value representing the body of the rule x, HEAD(x) - the 
SET OF value representing the head of the rule x, SUPPORT(x) - support of the rule x, 
CONFIDENCE(x) - confidence of the rule x, ^ [NOT] SATISFIES x - TRUE if the set 
s satisfies the rule x, s [NOT] VIOLATES x - TRUE if the set s violates the rule x. 

The following example query displays all sets from PURCHASED ITEMS table, 
which violate the association rule "A & D -> B & C & G". 



SELECT SET (ITEM) 

FROM PURCHASED_ITEMS GROUP BY T_ID 

HAVING SET(ITEM) VIOLATES TO RULE (' A, D B , C , G ', 0 .8,0.1); 



The SEQUENCE OF data types family is used to represent sequences of sets of items, 
e.g. histories of customers’ purchases. Sequences are ordered collections of (time- 
stamp, value) pairs, where timestamp is usually of date and time type and value can be 
a set of elements of any type. For example, SEQUENCE OF CHAR INDEX BY DATE 
data type represents a sequence of sets of character items ordered according to time- 
stamps of date type. Below we give an example of a statement, which creates a new 
table to store purchase histories. 



CREATE TABLE SHOPPING_HI ST 

(CUST_ID NUMBER, HIST SEQUENCE OF CHAR INDEX BY DATE) ; 



In order to convert a collection of (timestamp, value) pairs into a SEQUENCE OF 
value, we use a new SQL group function called SEQUENCE. The example below 
shows a query returning the SEQUENCE OF values from normalized market baskets. 



SELECT SEQUENCE (T_TIME, ITEM) 

FROM CUST_TRANSACTIONS GROUP BY C_ID; 

SEQUENCE (T_TIME, ITEM) 

<Feb 20,2000; (A)> <Feb 21,2000; (D)> 
<Feb 20, 2000; (B, D) > <Feb 22,2000; (A)> 



CU S T_T RAN SACT IONS : 

T TIME C ID ITEM 



Feb 


20, 


2000 


10 


A 


Feb 


20, 


2000 


20 


B 


Feb 


20, 


2000 


20 


D 


Feb 


21, 


2000 


10 


D 


Feb 


22, 


2000 


20 


A 
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A user can insert a sequence into the table manually, using a new SQL function called 
TO SEQUENCE. The following example presents the appropriate INSERT statement 
and the resulting contents of SHOPPING HIST. 

INSERT INTO SHOPPING_HIST (CUST_ID, HIST) 

VALUES (51, TO_SEQUENCE ( '<Feb 20 , 2000 ; (A) > <Feb 2 1 , 2000 ; ( D, F) > ' ) ) ; 

We also define a set of the following SQL functions and operators that operate on 
sequences: ELEMENT(x,i) - the z-th element value of the sequence x, LENGTH(x) - 
the number of elements in the sequence x, SIZE(x) - the number of items in the se- 
quence x, s CONTAINS t - TRUE if the sequence s contains the sequence t, s CON- 
TAINS t MAXGAP X - TRUE if the sequence s contains the sequence or pattern t and 
the interval between adjacent matching elements in s is not greater than x, s CON- 
TAINS t MINGAP X - TRUE if the sequence s contains the sequence or pattern t and 
the interval between adjacent matching elements in 5 is not less than x, s CONTAINS t 
WINDOW X - TRUE if the sequence s contains the sequence or pattern t within the 
time window ofx, s CONTAINS 1 TOLERANCE x - TRUE if the sequence s contains 
the sequence or pattern t with the tolerance ofx. 

The PATTERN OF data types family is used to represent sequential patterns and 
their statistical significance (support or number of occurrences). PATTERN OF CHAR 
data type represents patterns on character values (e.g. "<(TV) (VCR) (DVD)>"), 
PATTERN OF NUMBER data type represents patterns on numeric values (e.g. "<(10 
20 30) (40 50)>"), etc. Below we give an example of a statement, which creates a new 
table to store sequential patterns. 

CREATE TABLE MYPATTERNS ( ID NUMBER, SP PATTERN OF NUMBER); 

A user can insert a sequential pattern into the table manually, using the new SQL 
function called TO PATTERN. The following example presents the appropriate IN- 
SERT statement and the resulting contents of MYPA TTERNS. 

INSERT INTO MYPATTERNS (ID, SP) 

VALUES (15, TO_PATTERN( '<(10 20) (30)>',0.3, null)); 

We also define a set of the following SQL functions and operators that operate on 
sequential patterns: ELEMENT(x,i) - the z'-th element of the pattern x, LENGTH(x) - 
the number of elements in the pattemx, SIZE(x) - the number of items in the pattemx, 
SUPPORT(x) - support of the pattern x, OCCURRENCES(x) - number of occurrences 
of the pattern X, p CONTAINS s - TRUE if the pattern p contains the sequence s. 

The following query displays all patterns from MYPATTERNS, which contain the 
subsequence <(10)(30)> (notice the dynamic sequence creation in the example). 

SELECT SP FROM MYPATTERNS 

WHERE SP CONTAINS TO SEQUENCE (' <1 ;( 1 0 ) ><2 ;( 3 0)>'); 



3.2 Mining Association Rules 

The central statement of the MineSQL language is MINE. MINE is used to discover 
association rules or sequential patterns from the database. MINE also specifies a set of 
predicates to be satisfied by the returned rules or patterns. In order to discover asso- 
ciation rules we use the following syntax of MINE statement. 
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MINE rule_expression [, rule_expression...] FOR column [, column ...] 

FROM {table I (query)} WHERE rule_predicate [AND rule_predicate...] ; 

where: mle_expression is the keyword RULE or a function operating on RULE 
{RULE represents a single association rule being discovered), column is the name of 
the table column or query column of the type SET OF, containing itemsets to be 
mined; when specifying multiple columns, then the itemsets are combined (columns 
must be of the same data type), table is the name of a table containing the itemsets to 
be mined, query is the SQL subquery, returning the itemsets to be mined, 
rule _predicate is a Boolean predicate on a function which operates on RULE, to be 
satisfied by returned association rules. 

The following MLNE statement uses the PURCHASED ITEMS tahlQ to discover all 
association rules, whose support is greater than 0.1 and confidence is greater that 0.3. 
We display the whole association rules, their bodies and supports. 

MINE RULE, BODY (RULE), SUPPORT ( RULE ) 

FOR X FROM (SELECT SET (ITEM) AS X 

FROM PURCHASED_ITEMS GROUP BY T_ID) 

WHERE SUPPORT (RULE) > 0.1 AND CONFIDENCE (RULE) > 0 . 3 ; 



3.3 Mining Sequential Patterns 

In order to discover sequential patterns we use the following syntax of MINE state- 
ment. 

MINE patt_expression [, patt_expression...] 

[WINDOW window] [MAXGAP maxgap] [MINGAP mingap] [TOLERANCE tolerance] 

FOR column FROM {table | (query)} 

WHERE patt_predicate [AND patt_predicate...] ; 

where: patt_expression - the keyword PATTERN or a function operating on PAT- 
TERN represents a single sequential pattern being discovered), window is 

the time window size, maxgap is the maximal gap allowed between consecutive ele- 
ments of an occurrence of the sequence, mingap is the minimal gap allowed between 
consecutive elements of an occurrence of the sequence, tolerance is the time tolerance 
value for pattern elements, column is the name of the table column or query column of 
the type SEQUENCE OF, containing sequences to be mined, table is the name of a 
table containing the sequences to be mined, query is the SQL subquery, returning the 
sequences to be mined, patt _predicate is a Boolean predicate on a function which 
operates on PATTERN, to be satisfied by returned sequential patterns. 

The following MINE statement uses the OUST TRANSACTIONS table to discover 
all sequential patterns, whose support is greater than 0.1. We display the patterns and 
their supports. 

MINE PATTERN, SUPPORT (PATTERN) 

FOR X FROM (SELECT SEQUENCE ( T_T I ME , ITEM) AS X 

FROM CUST_TRANSACTIONS GROUP BY C_ID) 

WHERE SUPPORT (PATTERN) > 0 . 1 ; 
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3.4 Using Views and Materialized Views 

Relational databases provide users with a possibility of creating views and material- 
ized views. A view is a virtual table presenting the results of the SQL query hidden in 
the definition of the view. Views are used mainly to simplify access to frequently used 
data sets that are results of complex queries. When a users selects data from a view, its 
defining query has to be executed but the user does not have to be familiar with its 
syntax. 

Since data mining tasks are repetitive in nature and the syntax of data mining que- 
ries may be complicated, we propose to extend the usage of views to handle hoih SQL 
queries and MineSQL queries. The following statement creates the view presenting the 
results of one of the data mining tasks discussed earlier. 

CREATE VIEW BASKET_RULES 

AS MINE RULE, BODY (RULE), SUPPORT ( RULE ) 

FOR X FROM (SELECT SET (ITEM) AS X 

FROM PURCHASED_ITEMS GROUP BY T_ID) 

WHERE SUPPORT ( RULE ) > 0 . 1 ; 

Any SQL query concerning the view presented above involves performing the data 
mining task according to the data mining query that defines the view. This guarantees 
access to up-to-date patterns but leads to long response times, since data mining algo- 
rithms are time consuming. In database systems it is possible to create materialized 
views that materialize the results of the defining query to shorten response times. Of 
course, data presented by a materialized view may become invalid as the source data 
changes. One of the solutions minimizing effects of this problem is periodic refreshing 
of materialized views. 

We introduce materialized data mining views with the option of automatic periodic 
refreshing. A materialized data mining view is a database object containing patterns 
(association rules or sequential patterns) discovered as a result of a data mining query. 
It contains rules and patterns that were valid at a certain point of time. Materialized 
data mining views can be used for further selective analysis of discovered patterns 
with no need to re-run mining algorithms. Using materialized views is easier than 
creating a table with columns of type RULE OF or PATTERN OF and filling it with 
results of a data mining query. Moreover, materialized views offer additional func- 
tionality, because they can be automatically refreshed according to a user-defined time 
interval. This might be useful when a user is interested in a set of rules or sequential 
patterns, whose specification does not change in time, but always wants to have access 
to relatively recent information. 

To create a materialized data mining view we use the following syntax. 

CREATE MATERIALIZED VIEW view_name 

[REFRESH time_interval ] AS mine_statement 

In the above syntax view_name is the name of a materialized view, time_interval de- 
notes the time interval between two consecutive refreshes of the view (in days), and 
mine_statemenl denotes any variation of the MINE statement. The REFRESH clause is 
optional since a user might not want a view to be refreshed automatically. 
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3.5 Taxonomies and Generalized Patterns 

In order to discover generalized association rules and generalized sequential patterns, 
MineSQL allows users to define conceptual hierarchies. A conceptual hierarchy, or a 
taxonomy, is a persistent database object created by means of a CREATE TAXON- 
OMY and INSERT statements. The following example illustrates a conceptual hierar- 
chy (called MY TAX) and the statements for its definition. 

CREATE TAXONOMY MY_TAX OF CHAR; 

INSERT INTO MY_TAX NODE ' R ' ; 

INSERT INTO MY_TAX NODE 'Gl' REFERENCES 'RE- 
INSERT INTO MY_TAX NODE ' G2 ' REFERENCES ' R ' ; 
INSERT INTO MY_TAX NODE 'A' REFERENCES 'Gl'; 

INSERT INTO MY_TAX NODE ' B ' REFERENCES ' Gl ' ; 

INSERT INTO MY_TAX NODE ' C ' REFERENCES ' G2 ' ; 

INSERT INTO MY_TAX NODE ' D ' REFERENCES ' G2 ' ; 

After a conceptual hierarchy has been created, MINE statements can be extended to 
use it. Additional keyword USING is used to specify the name of a conceptual hierar- 
chy to be used for the associated attribute. The following example data mining query 
discovers all generalized association rules between values of the attribute ITEM in the 
database table PURCHASED ITEMS, grouped by the T ID attribute. Values of the 
attribute ITEMS are generalized by means of the conceptual hierarchy called 
MY TAX. 

MINE RULE FOR ITEMS USING MY_TAX 
FROM (SELECT SET (ITEM) AS ITEMS 

FROM PURCHASED_ITEMS GROUP BY T_ID) 

WHERE SUPPORT (RULE) >0.3 

Conceptual hierarchies also influence previously presented set, rule, sequence, and 
sequential pattern operators: ^ CONTAINS q USING c - TRUE if the set s contains the 
set q according to the conceptual hierarchy c, s [NOT] SATISFIES x USING c - TRUE 
if the set s satisfies the rule x according to the conceptual hierarchy c, s [NOT] VIO- 
LATES X USING c - TRUE if the set s violates the rule x according to the conceptual 
hierarchy c, s CONTAINS t USING c - TRUE if the sequence s contains the sequence t 
according to the conceptual hierarchy c, p CONTAINS s USING c - TRUE if the pat- 
tern p contains the sequence s according to the conceptual hierarchy c. 



R 




Gl G2 

/ \ / \ 



A B C D 



4 Concluding Remarks 

In the paper we have presented our research prototype system, which logically extends 
DBMS functionality to mine relational databases. We introduced a data mining query 
language, called MineSQL, which supports pattern discovery, storage and manage- 
ment in relational environment. A number of illustrative examples for various 
MineSQL statements have been presented. 

In the future we plan to extend the results of our research along the following lines: 
1. DBMS support for data mining in object-oriented databases, focused on using in- 
heritance hierarchies as taxonomies and mining of polymorphic collections, 2. Re- 




392 T. Morzy, M. Wojciechowski, and M. Zakrzewicz 



search on data mining query optimization methods, focused on materialized views 
group refreshing and physical data structures, 3. Study whether other data mining 
methods should be supported by MineSQL, e.g. data classification, clustering. 
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Decision trees for probabilistic data 

Jean-Pascal ABOA*, Richard EMILIONl 



Abstract 

We propose an algorithm to build decision trees when the observed 
data are probability distributions. This is of interest when one deals with 
massive database or with probabilistic models. We illustrate our method 
with a dataset describing districts of Great Britain. Our decision tree 
yields rules which explain the unemployment rate. 

The decision tree in our case is built by replacing the test X > a, which 
is used to split the nodes in the usual case of real numbers, by the test 
P(X > a) < (3, where a and (3 are determined through an algorithm based 
on probabilistic split evaluation criteria. 



1 Introduction 

Decision trees are particularly suited to data mining (see e.g [1], [6]). Generally, 
they are constructed from a set of objects whose attributes are numerical or 
categorial deterministic variables. In the present paper we present a decision 
tree construction method for objects described by probabilistic data. Such de- 
scriptions naturally appear because of the phenomenal growth of datasets size. 
This argument is illustrated in the example in section 2. 

More generally, probability distribution functions (p.d.f.) appear when the 
initial data are inaccurate or if one deals with probabilistic models. We thus see 
that it is a very natural (and also theoretically interesting) problem to extend 
standard statistic or data analysis methods which usually work for real data to 
the case of more complex data such as p.d.f.. This is emphasized for example 
in [5], [8]. 

The rest of this paper is organized as follows. Section 2 presents an example 
of G.B. districts described by probabilistic data. In section 3, we introduce our 
basic setting. Section 4 is devoted to the decision tree construction algorithm. 
This yields probabilistic rules which explain the unemployment rate of each 
district. A conclusion is given in section 5. 
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2 An example : G.B. districts 

We start with a table of N rows and p columns. All the entries fij on a fixed 
column j are probabilites on a common space Vj which only depends on j. Let 
N = 135 districts of Great Britain (G.B.) and p = 6 descriptors : 

Xi denotes the age descriptor (in years) : 

Wi = 1 for 0 - 4, Wi = 2 for 5 - 14, Wj = 3 for 15 - 24, Wj = 4 for 25 - 44, 
Xi = 5 for 45 — 64, Xi = 6 for greater than or equal to 65. 

X2 denotes the racial origin : 

X2 = 1 for Whites, X2 = 2 for Blacks, X2 = 3 for Asians, W2 = 4 otherwise. 

X3 denotes the number of children : 

W3 = 1 for less than 4 children, W3 = 2 for more than 4 children. 

Xi denotes the accomodation type : 

X4 = 1 for owner occupied accomodation, X4 = 2 for public sector accomoda- 
tion, X4 = 3 for private sector accomodation, W4 = 4 in other cases. 

W5 denotes the social class of household head : 

W5 = 1 if head is in class 1 or 2 , W5 = 2 if head is in class 3, X 5 = 3 if head is 
in class 4 or 5, W5 = 4 for the others. 

Xq denotes the occupation domain : 

Xq = 1 for Agriculture, Xq = 2 for Primary Production, Xq = 3 for manufac- 
turing, Xq = 4 for Service, Xq = 5 in other cases. 



Districts 


Age 


Origin 


Number of kids 


Hambleton 


6% 12% 14% 28% 24% 16% 


99.6% 0.1% 0.1% 0.2% 


1.0% 99.0% 



Districts 


Accomodation 


Social class 


Occupation 


Hambleton 


73.0% 12.0% 7.0% 8.0% 


46% 10% 15% 29% 


10% 3% 11% 42% 34% 





Age 


Origin 


Nber of kids 


Accomodat. 


Social c. 0 


:cupation 


H amhleton 


J 


,1 


1 




\] 




L. 




1 




1 


J 



The classifying attribute Y is the unemployment rate for men and women. 
In our application this rate fluctuates between 0% and 18%. We define two 
prior classes : class 1 for rate < 9% and class 0 for rate > 9%. 
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3 Basic setting 

In a general situation we will consider a table {fij)i=i,...N,j=i,...p of probabilities 
on sets Vj as described in section 2. Let o, denote the object number i described 
by the i^^-row fi = {fn , . . . /,p). The whole set {oi , . . . , o^r} will be denoted by 
O. In example of section 2, o, is nothing but the i*^-district. 

It will be convenient to suppose that in each column j , fij is the probability 
distribution of a random variable Xij : (0, P) Vj. 

The row random vectors Xi, = (Xu , . . . , Xip) are independant (but not the 
column vectors). For each j let Vj be a finite class of subsets of Vj which will 
be used to split the decision tree nodes. 

In the example of section 2, we can take Vj as the power set of Vj. But if Vj 
is large we select relevant subsets by generalizing the result in ([3] p. 274). 

The classifying variable T is a categorial attribute. We let F(o,) = y*. 



4 The method and the algorithm 

The root (starting node) of the decision tree is the whole set O. Given a node, 
we split it into two children in the best way possible. The splitting is determined 
by an attribute Xj and the best j is choosen according to a splitting quality 
criterion. We repeat the procedure on each child. 

So decision trees successively divide the set of training until all the subsets 
consisting of data belong, entirely or predominantly, to a single class. 

4.1 Dichotomy 

Let Xj be an attribute, and let I € Vj and j3 € (0, 1). The dichotomy induced 
by (Xj , 7, (3) is defined by : 



Djj,p = {Mi,N2} ( 1 ) 

Ml ={0i -.P (Xij G 7) > /3} and M2 = {oi ■. P (Xij G 7) < /3} 



4.2 Goodness of a split 

Our method uses the entropy index [7] to derive the splitting criterion at every 
internal node of the tree. To each node M is associated a homogeneity index i_\f 
defined by : 



W — -Po / X^og2Po / M - Pi/j\flog2Pi/j\f (2) 

where Po/Af (resp. Pi/j^f) is the proportion of objects in M belonging to class 
0 (resp. class 1). 
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Each division = {A/i,A/ 2 } of the node Af, induces a homogeneity 

increase defined by the following number : 

A =ipf - {PpSiiMi + -FV2W2} ( 3 ) 

where are the homogeneity index of nodes Af, Af\, Af^ respec- 

tively, and , Pjv '2 are the proportion of objects assigned to Af\ , A /2 respec- 
tively after partitioning. 

For each node Af and each j, we select Ij € Vj and Pj € (0,1) for which 
A (Hj j. p.) is maximal. We finally select the ’’best” Xj, among the p attributes 
by: ’ ’ 



A{Dj,j.,^p.,) = inax A(Djj^p) 



(4) 



4.3 Algorithm 



Let e > 0 be a fixed integer 
Starting node Af = {o\, ... ,on} 

Partition(node Af) 

If ((100 — e)% of objects in Af are of the same class ) then 
return; 

For each j 
begin 

For each r = 1, . . . , rij 
begin 

determine the best Pjr according to the test 

^ (ff.j ^ fj'^) ^ 3jr 
End 

determine the ’’best” (Ij,Pj) among (Ijr,Pjr), r = 1,. . . ,rij 

End 

determine the ’’best” (Ij* , Pj*) among (Ij,Pj), j = 1,. . .p, which split 
node Af into two nodes A/i and A /2 . 

Partition(node A/i) 

Partition(node A/ 2 ) 



4.4 Application 

The above algorithm yields the following tree and rules. 






Decision Trees for Probabilistic Data 397 




Figure 1: Decision tree for the G.B. districts 



Rule 1 : if % (residents in households where head is class 3, 4 or 5) > 30% 
and % (residents in households where head is class 1 or 2) > 32% 
and % (residents in households where head is class 1 or 2 or other cases) > 66% 
then the unemployment rate is high with 75% chance and weak with 25% chance 



Rule 2 : if % (residents in households where head is class 3, 4 or 5) > 30% 
and % (residents in households where head is class 1 or 2) > 32% 
and % (residents in households where head is class 1 or 2 or other cases) < 66% 
then the unemployment rate is weak with 100% chance 



Rule 3 : if % (residents in households where head is class 3, 4 or 5) > 30% 

and % (residents in households where head is class 1 or 2) < 32% 

then the unemployment rate is high with 97% chance and weak with 3% chance 



Rule 4 : if % (residents in households where head is class 3, 4 or 5) < 30% 
and % (residents in owner occupied accomodation) > 78% 
then the unemployment rate is weak with 100% chance 



Rule 5 : if % (residents in households where head is class 3, 4 or 5) < 30% 
and % (residents in owner occupied accomodation) < 78% 

then the unemployment rate is high with 27% chance and weak with 73% chance 



Table 1: Probabilistic rules 
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Remark : Chavent [4] has proposed a similar algorithm with /3 = 0.5 but 
j3 = 0.5 need not be the most pertinent threshold. Indeed the test P{X^ € 
{2,3}) > 0.3040 gives an index of 0.2989, while for (3 = 0.5 the ’’best” test 
P(yXz e {3,4}) > 0.5 gives an index of only 0.2989. 

5 Conclusion 

We have presented a decision tree construction method for probabilistic data. 
This seems of interest when we deal with very large dataset. Our algorithm has 
been implemented in C language with a dataset of G.B. districts. 

Our main new technical point concerns the selection of subsets in the splitting 
test. This may lead to further research. 
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Abstract. In data mining, searching for frequent patterns is a common 
basic operation. It forms the basis of many interesting decision support 
processes. In this paper we present a new type of patterns, binary ex- 
pressions. Based on the properties of a specified binary test, such as 
refiexivity, transitivity and symmetry, we construct a generic algorithm 
that mines all frequent binary expressions. 

We present three applications of this new type of expressions: mining 
for rules, for horizontal decompositions, and in intensional database re- 
lations. 

Since the number of binary expressions can become exponentially large, 
we use data mining techniques to avoid exponential execution times. We 
present results of the algorithm that show an exponential gain in time 
due to a well chosen pruning technique. 



1 Introduction 

In data mining, searching for frequent patterns is a basic operation. It forms 
the basis of many interesting decision support processes. Most data mining al- 
gorithms first start searching frequent patterns. In association rule mining [1], 
frequent itemsets are mined. 

In this paper we will present a new type of patterns, binary expressions, that 
will be the basis of three applications. A binary expression is a conjunction of 
binary tests between attributes. An example of such an expression using the test 
< is (1 < 2) A (2 < 3), expressing that attribute 1 is smaller than attribute 2 and 
attribute 2 is smaller than attribute 3. A binary expression will be called frequent 
iff the number of tuples satisfying the expression is bigger than a given threshold. 
Based on the properties of a specified binary test, such as refiexivity, transitiv- 
ity and symmetry, we construct a generic algorithm that searches all frequent 
binary expressions. The properties are used to avoid syntactically different, but 
semantically equal expressions. The following two different expressions 

(1< 2) A (2 < 3) 

(1< 2) A (2 < 3) A (1< 3) 
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will select exactly the same tuples. This is due to the fact that the binary test 
< is transitive. We will give a method to avoid generating both expressions. 

In this paper we present three applications that have the mining of frequent 
binary expressions in common. The first one is rule mining. A binary association 
rule is a rule X ^ Y, where X and Y are binary expressions. Just like in 
association rule mining, we define notions of support and confidence for this type 
of rules. The similarities with association rules will be elaborated in Section 3. 

The second application is in making horizontal decompositions. Horizontal 
decompositions have already been studied extensively[3] and are important in 
the context of distributed databases. When we want to make a horizontal de- 
composition, it is important to find a good criterion to split the relation. We 
will use the frequent binary expressions to make an optimal decomposition of a 
relation, based on target sizes of the fragments. 

A third application is mining with intensional database relations[4]. In In- 
ductive Logic Programming (ILP), the mining base typically contains intensional 
relations besides the traditional extensional relations. In this context, in the min- 
ing process, the intensional relations can be viewed as tests, in addition to the 
traditional tests such as <,=,..., and rules that contain intensional relations 
can be mined in much the same way as other tests. 

The outline of the paper will be as follows: in Section 2 binary expression, 
equivalence of expressions and some other notions are formally defined. In Sec- 
tion 3 the applications mentioned above are studied. In Section 4 we give some 
properties of the search space of the algorithm. In Section 5 we present a generic 
algorithm to find all frequent binary expressions. In Section 6 some experimental 
result of the algorithm are given. These results show good scalability properties 
of the algorithm. Section 7 concludes the paper. 

An extended version of this paper is available as [2]. 

2 Definitions 

Before we elaborate the three applications given in the introduction, we define 
formally the notions of respectively a relation, a binary test, a binary expression 
and equivalence of expressions 

First we fix the relations we will consider. We only look at relations where all 
attributes have the same domain .U is n, possibly infinite, recursive set. We 
use an unnamed perspective; i.e. we will refer to the attributes by their number. 

Definition 1. An n — ary relation is a finite subset ofU^. 

A binary test^ 6 over U is a recursive subset ofU xU. When (ui,U 2 ) G 0, we 
will write ui$U 2 - 

We now define the notion of an expression. 

^ 14 stands for Universe. 

^ We use the name binary test instead of relation, to avoid confusion with database 
relations. Actually, a binary test is just a relation in the mathematical sense. 




Mining Frequent Binary Expressions 401 



Definition 2. Let 6 be a binary test. A {6, n)- expression (6 andn will be omitted 
when clear from the context) is a conjunction of{i6j)’s, where 1 < < n. The 

set of all {6 ,n)-expressions will be denoted by £{6,n). 

The previous definition gave the syntax of an expression. The next definition 
gives the semantics of expressions. 

Definition 3. Let IZ be an n-ary relation, e is a {6 ,n)-expression. a^TZ = {r G 
TZ I {y{i0j) in e)r{i)6r{j)}^ is the selection on e of TZ. 

We are now ready to state the problem of mining all frequent binary expres- 
sions. 

Definition 4. Let TZ be an n-ary relation. The frequency of the binary expres- 
sion e G £{6,n), denoted freq{e,TZ) is ■ 

Let t be a number between 0 and 1. A binary expression e G £{6,n) is t- frequent 
iff > t- (t 'wiii omitted if clear from the context.) 

The solution of the freq{TZ,t, 9) -problem is the set of all t-frequent {6,n)- 
expressions. 

Example 1. Consider the following relation: 

The solution of the freq{TZ, |, <)-problem is the set {1 < 2, 
1<4,3<2,4<2,1<2a1<4,1<2a3<2,K2a4<2}. 



3 Applications 

In this section we describe three applications of mining frequent binary expres- 
sions. 




3.1 Rule Discovery 



First we define a binary association rule. 



Definition 5. A binary association rule is a rule X , where X and Y are 
{6, n)- expressions. 



The support of the rule X —^Y is freq{X aY,TZ). 

The confidence of the rule X ^Y is 

•' freq{x,TZ) 



Example 2. Consider the relation given in Example 1. The support of the binary 
association rule 1<4— >l<2is|. The confidence is 1. 

® r{i) denotes the i-th component of r; e.g. (o,6, c)(2) = b. 
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There are multiple similarities between association rules and binary associ- 
ation rules. Both rules give frequent dependencies that hold within the tuples 
themselves. Unlike for example roll-up dependencies [9], that describe relations 
between different tuples, association rules and binary association rules relate 
properties of attributes. In association rule mining, frequent itemsets can be 
considered as a conjunction of unary predicates. In this setting, binary asso- 
ciation rules are a straightforward extension of the unary predicates to binary 
predicates. A binary association rule finds associations between binary predi- 
cates, where association rules find associations between unary predicates. 

3.2 Horizontal Decompositions 

Horizontal decompositions are very important for distributed databases. In many 
cases it is desirable to fragment the database over different locations. In that case 
it is important to find good criteria to divide the database. We will call this a 
split-problem. The solution to a split-problem is an expression that selects a 
fraction of the tuples whose cardinality is as close to the given goal as possible. 

Example 3. Consider the relation given in Example 1. 3 < 2 is a solution for 
the split- problem where the goal is |, and the binary test <, since \(Jz< 2 Ti\ is as 
close to ^ as possible. 

3.3 Intensional Database Relations 

In Inductive Logic Programming (ILP) [4], mining conjunctions with intensional 
relations besides extensional relations is very common. The mining base used 
in Logic Programming typically contains a number of extensional relations and 
some intensional relations. The intensional relations are given by a set of de- 
scribing rules in a logic programming language, for example Prolog or Datalog. 
In the context of mining, the intensional relations can be viewed as tests, in 
addition to the traditional tests such as <,=,... 

Example 4- Suppose the following logic program is given: 

Related (X,Y) : -Parent (X ,Y) ; 

Related (X, Z) : -Related(XjY) & Related(Y,Z) ; 

Related (X,Y) : -Related (Y,X) ; 

Related (X,X) ; 

From the last three rules we can conclude that the binary relation Related 
is transitive, symmetric, and reflexive. 

In this example the intensional relation Related is in fact a binary test. Using 
this similarity, we can apply all results we obtain for mining binary expres- 
sions to this case. Suppose for example than we have a predicate King and we 
use the binary test Related. We could for example find that the expression 
Related(A', y)&King(A')&King(y) is frequent. Because we know that Related 
is symmetric, we know that testing Related(A', y)&Related(y, A')&King(A')& 
King(y) is redundant. 
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4 The Search Space 

The freq{TZ, t, 0)-problem is essentially a search-problem. We want to find all 
frequent binary expressions in the search space £{6, n). For all binary tests 9, the 
number of expressions m£{6, n) is 2^" \ since the number of pairs of attributes is 
n^, and for every pair {x, y), x6y is present or absent. However, it is not always 
necessary to consider all expressions. When there are equivalent expressions, 
there is no need to consider them all. 

Example 5. l=2Al=3is equivalent to 1 = 2 A 2 = 3. 

In Tab. 1, for some binary tests and different number of attributes, the total 
number of non-equivalent elements in the search space is given. For example, for 
the equality and 3 attributes, the search space is {1 = 1,1 = 2,1 = 3,2 = 3,1 = 
2 = 3}. Therefore, in Tab. 1, the row for n = 3 contains 5, the size of the search 
space. The value of 2^" I is also given for each value of n. 

Table 1. Size of the search space for some binary tests 



n 


< 


< 




= 


2«^ 


T 


2 


1 


2 


1 


2 


2 


3 


4 


2 


2 


16 


3 


19 


29 


8 


5 


512 


4 


219 


355 


64 


15 


65536 



To exploit the equivalence of expressions we need some properties of the ex- 
pressions to decide when two expressions are equivalent. Based on these proper- 
ties we will construct a mechanism to avoid generation of equivalent expressions. 

Definition 6. A binary test 9 has property 
Pi = reflexive iff for all 1 <i <n, {i9i) holds. 

Qi = anti- reflexive iff for all 1 <i <n, {i9i) does not hold. 

P2 = symmetric iff for all 1 <i,j <n, if {i9j) then also {j9i) holds. 

Q2 = anti-symmetric iff for all I <i,j <n, if (i9j), then (j9i) does not hold. 
P3 = transitive iff for all 1 <i,j,k< n, if (i9j) and (j9k), then also (i9k) holds. 
Qz = anti-transitive iff for all 1 <i,j,k< n, if (i9j) and (j9k), then (i9k) does 
not hold. 

From definition 6, it results that for each i, Pi holds or Qi, or none. This 
means that there exist 3® = 27 possible combinations. However, only 16 of them 
really exist. 

Definition 7. Let 9 be a binary test, and let P C {Pi,P 2 ,P 3 } be the set of P- 
properties of 9. An expression e G £{9,n) is closed iff every conjunct (i9j) that 
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is necessary by the properties of P appears in e. 

Let Q C {Qi, Q2, Qs} be the set of Q -properties of 6. An expression e € £{6,n) 
is valid iff all conjuncts that are forbidden by the properties of Q, do not appear 
in e. 

Example 6. Clearly, e = (1 < 2) A(2 < 3) is not closed since 1 < 3 is necessary by 
the transitivity and does not appear in e. On the other hand e' = (1 < 2) A (2 < 
3) A (1 < 3) is closed, e and e' are both valid expressions. (1 < 2) A (2 < 1) is 
not valid, since the anti-symmetry forbids (2 < 1) when (1 < 2) is present. 

Lemma 1. Given a valid {6,n)-expression e, there is a unique valid and closed 
expression e' , that is equivalent with e. e' is obtained by augmenting e with all 
conjuncts that are necessary by the properties of P of 6. e' is called the closure 
ofe. 

In example 6, e' is the closure of e. It is clear now that in every equivalence 
class of expressions there is a unique closed expression. Since we have to test only 
one expression of each equivalence class, for solving the /reg(7?., f, 0)-problem, it 
is sufficient to test each closed expression. 

5 Algorithm 

In this section we describe an algorithm that finds all frequent binary expressions 
given a binary test and a relation. Basically, the algorithm performs a levelwise 
search as described in [6]. The levelwise algorithm is a generate- and-test algo- 
rithm. It highly depends on a monotonicity principle saying, roughly speaking, 
that whenever ei is more specific than 62, and the result of €2 is too small then 
the result of ei is also too small. The next proposition states this monotonicity 
principle. 

Proposition 1. Let ei and 62 be two expressions, TZ is a relation, and ei is 
more specific than €2, then ffeiTZ] < \ueffi\- 

Consider the following situation: We want to solve the freqilZ, <)-problem, 
and we know that the expression 1 < 2 is not frequent. Then, using Proposition 
1, we know that 1 < 2 A 1 < 3 cannot be frequent, since l<2Al<3is more 
specific than 1 < 2. So, in this situation there is no need to count the frequency 
of 1 < 2 A 1 < 3. We can prune the expression 1 < 2 A 1 < 3. 

The search space of our algorithm will consist of all closed and valid expres- 
sions. In Fig. 1 a part of the search space for freq{TZ, 3, <) is showed. When we 
use the term children of an expression, we mean the expressions that are next 
more specific in the lattice. 

Our algorithm will try to prune as much of the search space as possible. We 
start with the most general expression of our search space, and we iteratively test 
more specific expressions, without ever evaluating those expressions that cannot 
be frequent given the information obtained in earlier iterations. More precisely, 
the search space is traversed level by level, from general to specific. In each 
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1<2 



1<3 



2<3 



(1<2)a( 1<3) (1<3lX(2<3) 

\ / 

(1<2)a(2<3)a(1<3) 



Fig. 1. A part of the search space 



iteration, the set candidates will contain the candidate frequent expressions. An 
“apriori trick” is used; if the frequency of e is below the threshold, and e' e, 
then we know a priori that e' must fail the frequency threshold. For this reason, 
all expressions that failed the frequency threshold are stored in the set TooLow. 
This gives us the framework of Fig. 2, which actually is a levelwise search [6]. 
Steps 3 to 7 are testing the candidates against the database and bookkeeping. In 
step 8 the children of the frequent candidates are generated as the candidates for 
the next iteration. In step 9, we use the apriori trick to prune away candidates 
that cannot be frequent due to information obtained in previous iterations. 



1 . 

2 . 

3. 

4. 

5. 

6 . 

7. 

8 . 

9. 

10 . 



candidates — {T}; Output — {}; TooLow — {} 

'while{candidates^ {}) do 

Test Test candidates against the database. 

fcan = {c e candidates \ c is frequent} 
nfcan = candidates — fcan 
Output — Output IJ fcan 
TooLow — TooLow IJ nfcan 

Generate candidates — Upg/con’t*' I ^ child of p} 

Prune candidates — candidates — {c | 3n G TooLow : c 

end while 



Fig. 2. Algorithm for finding frequent expressions 



5.1 Testing 

In the test-phase, the frequencies of the candidates are tested against the data- 
base. The calculation of the frequency of an expression is very costly, since we 
need to iterate over all tuples in the relation to count the number of tuples that 
satisfy the expression. To limit the overhead, all candidates in an iteration are 
tested in the same run over the database. 

5.2 Generation 

In the generation phase, we need to generate all closed and valid children of 
the frequent candidates. As can be seen in Fig. 1, all children are generated by 
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adding one conjunct, and taking the closure. However, not all conjuncts can be 
used for generating children; in Fig. 1, the closure of (1 < 2), augmented with 
(2 < 3) is (1 < 2) A (2 < 3) A (1 < 3), and this is no child of (1 < 2), since 
(1 < 2) A (1 < 3) lies between them. In the generation phase this problem is 
handled. 

In the framework of the algorithm, in the generation phase, all children of the 
frequent candidates are generated. It is however sufficient that every expression 
only generates a subset of its children, as long as for every expression there is 
still at least one generating parent. We only generate those children that are 
induced by a sublattice of the search space. 

Not generating all children does no harm; still all expressions are generated. 
On the other hand, not generating all children has a couple of advantages. 

— In step 8. of the algorithm, all expressions generated by frequent candidates 
are added as new candidates. Probably lots of duplicates are generated. 
These duplicates need to be removed. The less children are generated, the 
less duplicates need to be removed. 

— By not generating all children, some early pruning is applied. We will discuss 
this in more detail in the subsection on pruning. 

From this discussion we can conclude that ideally each expression has exactly 
one generating parent. This is the case when the spanning sublattice is a tree. 

The generation phase is discussed more in-depth in [2], where we introduce 
two strategies for the generation, pi and p^- 



5.3 Pruning 

A basic operation of the algorithm is the pruning. It is essential that this oper- 
ation is performed as efficiently as possible. The pruning implies that for every 
expression e that is generated in step 8 of the algorithm, we need to investigate 
whether there is an expression I in TooLow such that e A /. If this is the case, 
we can prune e. 

From previous research, we can conclude that a trie is a good structure to 
store sequences. A trie uses common prefixes between the sequences to store 
them more efficiently. We are not going into detail on tries, for more elaborated 
work on tries, we refer to [5] and [7]. 

Step 8 is not the only step in which pruning occurs. When all generating 
parents of an expression are infrequent, the expression will never be generated, 
even when there are other parents that are frequent. Thus, the less parents a 
node has, the bigger the chance that it never will be generated if some of its 
parents are infrequent. This type of pruning is called early pruning. When an 
expression is not pruned early, it can still be pruned in step 8 of the algorithm. 
This situation occurs when at least one generating parent is frequent, and at least 
one other parent is infrequent. Due to the monotonicity principle the expression 
will be pruned. 
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6 Experimental Results 

In this section we present some experimental results. We implemented pi and 
P 2 - For a definition and a discussion of these two algorithms, we refer to [2]. The 
source code of both implementations can be obtained at 
http : / / cc-www . uia. ac .be/u/ calders/. 

6.1 Effectiveness of Pruning 

In Fig. 3 (left), a lower bound for the total number of closed and valid expressions 
in £{<,n) is given for n = 2, 4, . . . , 20 for reference. In Fig. 3 (right), some 
tests on a randomized dataset are given for increasing number of attributes. 
The number of expressions that are examined by our algorithms is given for a 
threshold 0.3, and for increasing number of attributes. Note that the scale of the 
graph representing the total size of the search space is logarithmic. The number 
of expressions examined by the algorithms in this example is exponentially less 
than the total number of elements in the search space. 
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Fig. 3. The size of the search space (left) versus the number of expressions that were 
investigated (right) 



6.2 Scalability 

In Fig. 4, the running time of the two algorithms is measured. When the number 
of attributes grows, refinement operator p 2 becomes much more efficient than 
pi. In the left graph, a threshold of 0.4 was used, and the binary test was <. 
In the right graph, the binary test = was used, and the test was done with the 
refinement operator p^, with a threshold of 0.5. In both graphs, the dataset was 
randomly generated. The number of values in U was 2 in the right graph, and 7 
in the left one. 

7 Conclusion 

Binary expressions are an interesting type of patterns for data mining. In this 
paper we presented three applications of frequent binary expressions; binary 
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Fig. 4. Scalability in the number of attributes 



rules, that essentially are extensions of association rules to binary predicates, 
horizontal decompositions and the mining of intensional database relations. We 
presented and tested an algorithm for finding frequent binary expressions. The 
algorithm exploited background information such as reflexivity, transitivity and 
symmetry of the binary tests to optimize the search. 
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Abstract. Text classification is becoming more important with the pro- 
liferation of the Internet and the huge amount of data it transfers. We 
present an efficient algorithm for text classihcation using hierarchical 
classihers based on a concept hierarchy. The simple TFIDF classiher is 
chosen to train sample data and to classify other new data. Despite its 
simplicity, results of experiments on Web pages and TV closed captions 
demonstrate high classihcation accuracy. Application of feature subset 
selection techniques improves the performance. Our algorithm is compu- 
tationally efficient being bounded by 0{n log n) for n samples. 



1 Introduction 

As the amount of on-line data increases by leaps and bounds, the design of an 
efhcient algorithm or an approach to accessing the data (e.g. through classih- 
cation, clustering. Ritering, etc.) has become of great interest. Two important 
aspects motivate such design. First, the data needs to be arranged efficiently. For 
example, instead of placing all the data in a fiat directory, we can arrange it hier- 
archically based on a concept hierarchy (see Yahoo, US Patent databases, CNN 
and other major Internet news directories [13,9,2]). Querying with respect to a 
concept hierarchy is significantly more efficient and reliable than searching for 
specific keywords since the views of the data collected are refined as we go down 
the hierarchy [1]. Second, when text classification is our chosen approach, an 
efficient algorithm should be used. A number of algorithms have been proposed 
and their performances are compared in the literature [14]. 

We use the TFIDF text classifier [11,4] and proceed with the following steps 
for hierarchical classification: The first step is to define the concept hierarchy 
using domain knowledge and to collect text data corresponding to the concept 
hierarchy. The data is then used for training the classifier and for testing the 
performance of our classification system. The next step is to convert the data into 
an appropriate form for classification (e.g. into a bag-of-words representation). 
Then we can derive hierarchical classifiers by supervised learning with the data 
collected. Finally, the classifiers are used to classify new data. Each of these steps 

Y. Kambayashi, M. Mohania, and A M. Tjoa (Eds.): DaWaK 2000, LNCS 1874, pp. 409-418, 2000. 
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is described in detail in the following sections, followed by experimental results 
and conclusion. 



2 Concept Hierarchy 



depth 1 



depth 2 



I'Dogers'; (Yankees) Spurs) i'Blazets'l 

/ y “y / "^y y ^y y "y y 'y / "^y y \ y ^y y ^y y ^y 

;Recapsji News )i Players) | Recaps ■( News )i Players |[ Recaps jj News Ji Player^ [Recaps 1[ News ji Players) depth 4 



i Brown) (Mondesi f Jeter ) (Clemen^ (Duncan ) (Robinsop ( Rider '[Wallace depths 



Fig. 1. A sample concept hierarchy for professional baseball and basketball. 



Figure 1 shows a concept hierarchy which categorizes Web news reports for 
professional baseball and basketball. Initially, a concept graph is generated based 
on some domain knowledge. Each node in the hierarchy contains several text 
documents whose topic is identihed as the concept. 

As in relational and object-oriented databases, in which there are no abso- 
lutely standard schema (tables and classes), we do not have a faultless concept 
hierarchy. In fact, we should not look for one that is exclusively superior to any 
other even for the same data. We need one that can help us in alleviating our 
humans semantic burden (since a concept hierarchy encapsulates semantics) and 
facilitate achieving efficiency in data arrangement and searching capability. 

However, unlike relational databases and object-oriented databases, Web doc- 
uments, are generally unstructured. Therefore, the way we describe their schema 
is different from the relational and object-oriented models. Because they are un- 
structured, a convenient way to describe a concept in our hierarchy is to use a 
collection of words (features) from a document. Since it is usually very easy for 
humans to come up with some top-level concepts as the schema, we propose a 
human-generated initial concept hierarchy. 

Before taking this concept hierarchy to perform feature extraction and clas- 
sihcation, there are several assumptions to be made. First, we assume that an 
initial concept hierarchy has been created with each node labeled by one or a 
few terms representing the concept. Second, we assume that several documents 
have been manually placed into every node, serving as the training and testing 
data for supervised learning. Our third assumption states that a parent node 
owns the union of the documents of its child nodes. 
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3 Training the Hierarchy 

After the initial hierarchy is set up and with training and testing documents 
placed into each node, we can proceed to “train” the hierarchy. The rationale is 
as follows: We will characterize those training documents residing in each node; 
hnd a range (or threshold) of the various characteristics for the node so that it 
can classify the test documents into certain concepts - that is, hnd suitable nodes 
to index the new documents, from top to bottom, in the concept hierarchy. 

Ideally, we would hope to characterize documents in each node with a “term” 
or a label. However, this is not practical because for text documents, one feature 
term will not sufhce in describing the whole document. Instead, we must “under- 
stand” the documents in some way. In the absence of a satisfactory solution to 
the natural language understanding problem, most current approaches to doc- 
ument retrieval use a bag-of-words representation of documents [11,4]. It is for 
this reason we opt for surface parsing and obtain a vector (a set of word features) 
of weights for each document with time complexity in the order of 0(n), where 
n is the number of words in the document. 

Specihcally, we restrict ourselves to a relatively simple yet effective approach 
based on the TFIDF (term freguency x inverse document freguency) classiher 
[11,4]. Since we are trying to separate documents into distinguishable concepts, 
intuitively, the combined effect of term frequency and inverse document fre- 
quency can distinguish two different document types well. Dehnitions of term 
frequency and inverse document frequency are to follow. 

3.1 Representing Featnres for Each Node 

To convert a collection of documents in each node into special representation, 
let’s examine the documents closely. First, every document is processed using 
stopping and stemming procedures [11,4] to obtain a bag of words. Stopping is 
the procedure to eliminate common words from the text, and stemming is the 
procedure to hnd a unique representation (e.g. root) for a word. After these 
procedures, we consider the following frequencies in the documents. 

— Term Freguency. Let W be the set of words from all documents. The term 
freguency of the word Wi (ith vocabulary in W), TF(wi,d), is the number 
of times Wi occurs in document d. 

— Document Freguency. DF(wi) is the number of documents in which word Wi 
occurs at least once. 

— Inverse Document Freguency: IDF(wi) = log( q ); where \D\ is the total 
number of documents among sibling nodes. 

— Term Freguency x Inverse Document Freguency: TFIDF(wi, d) = TF(wi, d)x 
IDF(wi) 

We subsequently merge all child feature vectors to obtain the TF vector for 
a node. Calculation of TF vectors continues bottom-up until we reach the root. 
Meanwhile, we also propagate the DF vector for each node all the way up to the 
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root. When this merging process is completed, the feature vector in each node is 
given by: F =< TFIDF(wi , d), TFIDF(w 2 , d), . . . , TF IDF{w\w\ , d) >, where 
d is the union of documents belonging to the node, and w is the union of the set 
of words in d. 



3.2 Determining the Threshold for Each Node 

Having organized training documents into different nodes (classes) and having 
built feature vectors for each of them, we characterize each class by a TFIDF 
vector. In other words, TFIDF will now serve as a norm or prototype vector to 
describe that class. 

Formally, let C be a collection of document classes of interest. A prototype 
vector c (for each class in the concept hierarchy) is generated for each class c E C 
as follows: c = ^ ■ 

We will use these training documents 
again to make a complete hierarchical clas- 
siher in two ways. First, for each prototype 
vector c, we need to introduce a threshold 
9 - a distance measure to indicate at what 
distance range to the prototype vector we 
consider documents fall into the same cat- 
egory (i.e. class). This becomes clear if we 
imagine that a TFIDF vector is lying in an 
n-dimensional hyper-space^, shown in Fig- 
ure 2. As can be seen, the threshold makes 
a boundary for a class. For simplicity, we vi- 
sualize it after normalization (i.e. as a unit- 
length vector) in a 3-D coordinate system. 

Second, we need to examine how good the 
hierarchical classiher is. That is, given the 
prototype vectors and thresholds, what is the 
aeeuraey with which the classiher will cor- 
rectly classify documents? Later, we will check the accuracy for other brand new 
test documents. However, in this section, we only consider the accuracy for the 
training documents which we used to come up with the prototype vectors. 

Deriving threshold and computing training accuracy are inter-related. Here 
we use classihcation training accuracy feedback to adjust the threshold. In other 
words, we adjust the threshold in such a way that all documents considered 
for the node will yield the best accuracy. Accuracy, as shown in the following, 
concerns the percentage of documents correctly classifying into a class: 




Fig. 2. Learning the threshold 
to admit new documents into a 
class. 



# of documents correctly cateqorized 

Accuracy = 

Of documents considered 



a + d 

a h 0 d 



^ Elements of the TFIDF vector are positive real numbers, so they can only lie in the 
hrst quadrant. 
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In the equation, we use a,b,c,d for (level = 1 to depth of hierarchy) 
to indicate respectively the number of for (every node j in level) 
documents that should be in a class := getPrototypeVec(iVo(iej); 

and are selected; the number selected Docs := getDocuments( 

but that should not be in; the number getSiblings(Ato(iej) IJ 

that should be in but are rejected; and getChildren(iVo(iej)); 

the number that should not be in and < Fd > := getFeatures(Docs); 

are not selected. < Dist > := cos(F„, < Fa >); 

Now, starting from the highest 0 := Find a cut d in < Dist > 

level of the hierarchy, we compare the that maximizes accuracy; 

training documents n (in vector rep- end 

resentation) with the prototype vec- end 
tor of the node. When we compare 
two feature vectors, a cosine function 

is commonly used for similarity mea- Fig. 3. Using accuracy feedback to ad- 

sure. That is, we compute the similar- just the threshold. 

ity of a document n to prototype vector c by: Vc G U cos(n, c) = \tc E C 

For every node, we consider those training documents belonging to all sibling 
and child nodes. We Rrst sort them by their distances (Dist) to the prototype 
vector. We then choose one distance that renders the best accuracy. This is the 
threshold, 0. Pseudocode for computing 0 is shown in Figure 3. (All the functions 
are self-explanatory.) 

3.3 Time Complexity Analysis 

Our approach involves surface parsing, i.e., obtaining features for words in the 
documents as well as training features in the concept hierarchy. In parsing, time 
is linear. Most of the time is taken up by training, especially when computing 
the thresholds. Assume that n is the total number of documents, m is the to- 
tal number of nodes in the hierarchy, and m << n. The contributing factors 
timewise are: computing TF takes 0(n); computing IDF takes 0(n); computing 
threshold 0 takes 0(n log n), where n log n is what the sorting costs timewise. 
All together, the time remains bounded by 0(n log n), which upholds our claim 
that this classification algorithm is fast. 

4 Testing the Hierarchy 

In this section, we test the performance of our hierarchical classifier using a test 
set of documents which is different from the training set. The ratio of the total 
number of testing documents to the training documents is about 1 to 2. 

4.1 Every Node as a Classifier 

When each node is equipped with a TFIDF vector and a threshold (0), we regard 
the node as a classifier. Again, starting from the highest level of the hierarchy. 
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we compare the test document n (in vector representation) with the prototype 
vector of the node. If these two vectors are close enough, based on the cosine 
measure, we treat the test document as belonging to the class that the node 
represents, and we continue to compare the document with its child nodes. If we 
do not have a close match, we conclude that the document does not belong to 
the class and stop. 

In our hierarchical classiher, a doc- 
ument is assigned to all the classes 
whose prototypes are sujjficzenily close 
to it. The pseudocode in Figure 4 
summarizes the classihcation process. 

4.2 Test Accuracy 

Test accuracy is dehned as the ratio of 
the number of correctly classihed doc- 
uments to the number of documents 
that are Rltered at each node, i.e., 

correctly classified 

test accuracy = 

^ of filtered 

Since the classification is per- 
formed from the top to the bottom 
of the hierarchy, some test documents 
“filter” through certain nodes and 
continue to drill down in the hierar- 
chy; others are stopped when their 
similarity measures do not pass the threshold of the prototype vectors. Con- 
sequently, the number of filtered documents diminishes as the testing process 
continues to the bottom level. 

5 Experiments 

For training and testing, we collected approximately 200 documents on pro- 
fessional baseball and basketball news. Experimental results demonstrate the 
feasibility of our approach with good classification accuracy. In training, most 
upper level nodes (classifiers) receive ratings of above 90% for training accuracy, 
while some lower level nodes perform on the average at a rate of 75%. In testing, 
the accuracy performance is relatively poorer, especially in lower nodes but ac- 
ceptable in higher level nodes. This is because the concepts become more specific 
as we go down the hierarchy. In addition, the number of documents considered 
is relatively small compared to those at higher levels. 

To account for the classification error, both in training and testing phases, 
we look into two different types of errors: false-positive (FP) and false-negative 
(FN). They are defined as follows (using a, b, c, and d as explained before): 



Let n be the vector to be classified. 
Class = (f) 

NodeSet = root 
while (NodeSet f) do 

Retrieve a nodci from NodeSet: 
NodeSet = NodeSet — nodci. 
Compute the prototype vector: 
c = TFIDF{nodei). 

Compute the similarity between n 
and c: s = cos(n, c). 
if (s > 0 ) 

Class = Class IJ { c }. 
NodeSet = NodeSet (J 

{ getChildren(nodei) }. 

end 

return Class 



Fig. 4. Finding the class for a new text 
document. 





A Fast Algorithm for Hierarchical Text Classification 415 



false-positive(FP) 
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false-negative{FN) = 
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errors for training and testing in the hierarchy. In the testing phase, a few lower 
level nodes resulted in relatively high FN. This is due to the depth of the nodes 
and the limited number of documents tested in those nodes. 
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(a) Training phase. (b) Testing phase. 

Fig. 5. False-positive (fp) and false-negative (fn) errors (in percents). 



5.1 Feature Subset Selection 

The above experiment has shown that the TFIDF feature is a reliable indicator 
for categorizing Web text. We now investigate whether a subset of these features 
can perform as well in text classihcation. 



Straight TFIDF Subset One feature selection represents each node with only 
a subset of word features. The new feature vector for each node consists of the 
top m leading TFIDFs after sorting the original vector by its TFIDF values: 
Fm =< TFI DF(wi, d),TFI DF(w 2 , d), . . . ,TFI DF(wm, d) >, where m <= 

Figure 6 illustrates the average accuracy (in percents) for classihers at each 
depth for both training and testing phases. In testing, 10 different subsets are 
compared. According to our Rndings, when TFIDFs are cut down by 40-50%, 
they scarcely affect the test accuracy. This has two implications: First, not only is 
TFIDF a good indicator in text classification, but those higher TFIDFs values 
are the dominating terms. Second, using subset features can reduce TFIDFs 
storage requirements drastically and improve efficiency without compromising 
classification accuracy. 



Special Positive/Negative Vectors Another way of selection is to introduce 
two special vectors (one positive, one negative) to represent the background 
knowledge. The positive and negative vectors are manually made. They are the 
human version of TFIDFs because most positive terms may coincide with the 
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Fig. 6. Average accuracy for classifiers at each depth of the hierarchy. 
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Fig. 7. Combining background knowledge in the classihcation. 



dominating TFIDFs, whereas negative terms are words that do not contribute 
to classihcation by human judgement. 

Figure 7 (a) shows how this background knowledge is incorporated into the 
original TFIDFs for every node in the hierarchy. During training, the negative 
vector is subtracted from the original TFIDF vector. Later, in testing, we impose 
a conhdence function and the positive vector on each node, and let the conh- 
dence function compete with the original similarity function. When background 
knowledge conhdence reaches a certain level, we abandon the cosme measure 
and determine the category right away. 

In this experiment, a very simple conhdence function is used - the number 
of positive words ((5) that occur in the document. Such background knowledge 
somewhat mimics the the way a human determines a category. In Figure 7 (b), 
for example, when a (5 of 4 is used, it reveals that combining such background 
knowledge into TFIDFs does improve the classihcation accuracy. 



5.2 Applied to TV Closed Captions 

We used the same approach as above, and applied it to TV closed caption data 
mixed with a few Web pages (15%). Closed captions usually contain more typos 
than Web pages. Even so, the classihcation accuracy showed some promising 
results (Figure 8). This is important because success in classifying TV closed 
captions can assist in video - or, generally multimedia - classihcation, in which 
case, time is of great concern. 
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(a) Concepts for TV closed caption data. 
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Fig. 8. Classifying TV closed captions with TFIDF and backgronnd knowledge(BK). 



6 Related Work 

A number of existing approaches are similar to hierarchical classihcation based 
on a pre-dehned concept hierarchy. Many of them are combined with feature 
subset selection which Rnds the best subset of features that improves classifica- 
tion accuracy, reduces measurement cost, storage, and computational overhead. 
One example, TAPER [1], makes use of a concept hierarchy and classifies text 
using statistical pattern recognition techniques. It finds feature subsets by the 
Fisher’s discriminant. Similarly, Mladenic and Grobelnik proposed a document 
categorization method based on a concept hierarchy [8]. They used the naive 
Bayesian classifier on feature vector of word sequences and employed feature 
subsets to yield good performance. McCallum et al. also proposed a hierarchical 
classification using the naive Bayesian classifier [5]. In particular, they suggested 
combining labeled and unlabeled data to boost the classification accuracy. 

There have been numerous approaches for automatic generation of a concept 
hierarchy. Their incentive is to eliminate the overhead of manually constructing 
a concept hierarchy. Sahami, for example, applied unsupervised clustering to 
generate a concept hierarchy from text data [10]. He made use of well-defined 
similarity measures to find the clusters as well as to feature subset selection. 
Sanderson and Croft presented a means of automatically deriving a hierarchical 
organization of concepts from a set of documents [12]. Their work used co- 
occurrence and subsumption conditions for selected salient words and phrases 
without standard learning or clustering techniques. In addition, there exist a 
variety of learning algorithms for text. Yang compared the performance of many 
learning algorithms [14]. Mladenic surveyed text-learning and related intelligent 
agents based on three key criteria: what representations to use for documents; 
how to select features, and what learning algorithm to use. Mitchell’s book [6] 
is yet another comprehensive source of machine learning algorithms. 

7 Summary and Discussion 

The design of an intelligent text classifier is of great importance in the current 
world filled with such vast amounts of data. The algorithm must be fast because 
time is critical. To fulfill these promises, we designed and implemented a hierar- 
chical classification system that leverages on a concept hierarchy and a simple 
and fast learning algorithm. The hierarchical aspect narrows down the search 
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space significantly by eliminating all irrelevant areas. Then, use of TFIDF and 
accuracy-feedback quickly makes the classiher. 

Regarding experimental results, the hierarchical classihers performed fairly 
well with the Web data and TV closed captions in the sports domain. Our 
preliminary work on feature subset selection also demonstrated improvements 
on classihcation accuracy and the cost associated with the use of features. Some 
avenues for future research include: 

— Exploitation of data structures: Structural information in the text (e.g., title, 
sections, references) can be exploited and differentiated. 

— Consideration of different types of data: The system can be tested with 
different types of text data - for instance, U.S. patent data. 

— Incremental learning: New data can be learned dynamically. Meanwhile, old 
data should be ignored after some time has elapsed. 

— Combination of labeled and unlabeled data: To reduce the overhead in data 
preparation prior to learning, unlabeled data can be combined with labeled 
data. 

— Automatic expansion/shrinkage of the concept hierarchy: Dynamic change in 
the concept hierarchy is needed to accommodate the newly formed concepts. 
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1 Introduction 

Recent rapid growth in the ability to generate and store data by more powerful 
Database Management Systems and hardware architecture, leads to a question: 
how can we take advantage of this large amount of information? Traditional 
methods for querying and reporting are inadequate because they can ordy ma- 
nipulate data and the information content derived is very low. Obtaining new 
relationships among data and new hypotheses about them is the aim of Knowl- 
edge Discovery in Databases (KDD) which makes use of Data Mining techniques. 
These techniques have interesting applications for business data such as market 
basket analysis, financial resource planning, fraud detection and the scheduling 
of production processes. In diis work we consider the applicatioii of Data Mining 
techniques for the analysis of the balance-sheets of Italian companies. 

Currently there is a large interest in financial markets data mining and large 
amounts of data for this domain are available. Analysis of these data for the 
purpose of abstracting and understanding market behaviour are being intensively 
explored. These analysis are also useful for making predictions about future 
market evolutions and several companies use analytical data mining methods 
for actual investment portfolio management. 

.A. number of Data Mining techniques e.g. rule induction, neural networks, 
and conceptual clustering, have been developed and used individually in domains 
ranging from space data analysis to financial analysis. Frequently, a single data 
mining technique is insufficient for extracting knowledge from a data set. Instead, 
several techniques must be employed cooperatively to support a single Data 
Mining application . In this paper we propose a hybrid Data Mining approach 
for the classification of balance-sheets. We show that by combining different 
classification techniques, such as induction of decision trees and clustering, it is 
possible to generate an appropriate partition and an intuitive description of data. 
We have implemented a system, called DMTool, based on the combined use of 
different Data Mining and classification techniques, for the analysis of balance- 
sheet data. Several experiments have confirmed the validity of our approach. 

* Work partially supported by a MURST grant under the projects “Data-X” and 
“Piano Teleinatico Calabria” and by the EC project “Contact” 

Y. Kambayashi, M. Mohania, and A M. Tjoa (Eds.): DaWaK 2000, LNCS 1874, pp. 419-424, 2000. 

© Springer-Verlag Berlin Heidelberg 2000 




420 G. Dattilo et al. 



2 Balance-sheet Classification 

In this section we present an application which combines different classification 
techniques, such as induction of decision trees and Bayesian clustering, for the 
analysis of the balance-sheets of the Italian companies. Although we have con- 
sidered several trades of the Italian economy, in this paper we only report results 
concerning the “Gum” trade. 

Data Structure. We have used a database containing the balance-sheets of all 
Italian companies. The most interesting data are stored in a table contain- 
ing 520.000^ tuples per year and each tuple contains 27 attributes (e.g. assets, 
turnover, profit, debt) describing the balance-sheet composition. First we have 
processed the data to eliminate redundant information and noise and selected 
the data used in our experiments as described below: 1) we have retained only 
data for the year 1998 (due to the jjresence of incomplete information for previ- 
ous years); 2) we have considered only balance-sheets closed on December 31®^ 
(for reason of uniformity); 3) we have analyzed different trades separately, since 
intrinsic differences in their behaviour would make any global analysis unreliable. 

Moreover, we have also normalized data. More specifically, real values are 
scaled by a 10® factor whereas percentage values are truncated. 

As mentioned above, in the experiments described in this paper, we focused 
the analysis on the “Gum” trade, which includes enterprises producing tyre and 
plastic etc.. This trade contains around 3.000 companies and, therefore, there are 
around 3.000 tuples whose Asset attribute (the one used by the domaiii c.xpert 
to define the classes) spans a large range of values, from very low amounts to 
very high ones. 

2.1 Combining Data Mining techniques 

For the analysis of the balance-sheets we first used standard classification tech- 
niques, with the support of an expert for the definition of classes and, next, we 
combined induction of decision trees and Bayesian clustering to derive automat- 
ically, without the support of the domain expert, the classes definition. 

The criteria suggested by the expert for the definition of classes consists in the 
partition of the domain of the attribute Assets as follows: Class 1: Assets < 5^, 
Class 2: 5 < Assets < 10, Class 3: 10 < Assets < 22, Class 4: 22 < Assets < 
38, Class 5: 38 < Assets < 63, Class 6: 63 < Assets < 140 and Class 7: 
Assets > 140. 

However, as we shall see, the description of the above classes is not satisfac- 
tory since more than the 12% of tuples are not classified correctly. Thus, we use 
Bayesian clustering to improve the partition suggested by the domain expert. 
This technique is based on the selection of the most relevant attributes (as sug- 
gested by the domain expert) and the application of the clustering technique in 

^ The number of Italian companies. 

^ Lit values scaled by a 10® factor 
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the following way: 1) generate a set of clusters by using the values of previously 
selected attributes; 2) memorize, for each cluster denoting a class, the associated 
tuples and delete selected attributes. 

Clearly, if we consider all attributes we derive a set of disjointed classes (ev- 
ery tuple belongs to a unique class). However, in this case, generally, it is not 
possible to give a characterization of the classes in terms of intervals which are 
the attributes domain. Therefore, we prefer to derive a partition into classes 
less accurate but easily comprehensible. We introduce a technique which repre- 
sents a good trade-off i.e. we search for a definition lesser accurate but easily 
understandable. 

For each class Cj, each attribute A{ has associated a range Rij. Since two 
ranges Rij and Rik may overlap, we select the attribute which minimizes overlaps 
as follows: 1) for each tuple we add the label provided by the clustering partition 
function; 2) we choose an attribute A' to characterize the clusters. In particular, 
we discretize A' by means of a supervised discretizing technique that uses the 
label assigned by the clustering model as target and for each tuple we choose 
the value obtained in the previous step as class label and remove the effective 
value of A' . 

For choosing such an attribute we proceed as follows. For each attribute k 
and cluster (7, the clustering algorithm gives the average mean value(pi,fc) and 
the standard deviation((T,-,jt) of k within Cj. We define Ri^k as: 

^^i,k — [t^i,k — !■ * F ^ 

where the value of I depends on the effective distribution of the attribute values 
within the class. We define an overlapping index as: 

n. ■ = card{Ii.j,k) 

card{Ci) + card{Cj) 

where Jjj,* = {X £ Ci U Cj\-Kk{X) £ Ri,k n Rj,k] and card(S) is the cardinality 
of the set S. We choose as target the attribute that minimizes the function: 
Rk — RiRj>iOij^k 

In our experiments the attribute that satisfies the above condition is Assets 
that induces the following partition: Assets < 20, 20 < Assets < 30, 30 < 
Assets < 40, 40 < Assets < 55, 55 < Assets < 90, 90 < Assets < 185 and 
Assets > 185. 

For reasons of uniformity we choose a partition into 7 classes characterized by 
a good accuracy, among the various ones obtained by clustering-based methods. 
This partition allows us to compare “homogeneous” experiments carried out on 
different class definition methods. Once we have defined a semantics for classes 
we use a classifier to build a model for data, containing a function for assigning 
objects to classes. Observe that the attribute used by the expert was the same 
attribute found by the clustering technique. 
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2.2 Experimental Results 

We performed several experiments on different trades. Here we only report some 
results regarding the “Gum” trade. 

For experimental purposes we used a training set having 1000 tuples. We 
first used the class partition suggested by a domain expert (an economic expert 
on balance-sheets). The results for this experiment are shown in Figure 1. 
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Fig. 1. Confusion matrix for domain expert partition 



Numbers out of the diagonal correspond to misdassified tuples, namely tu- 
ples having an effective class (database class) different from the predicted class 
(founded by the algorithm). As an instance, in the second row we can note 
that “Class 2” (see previous section) is composed of 144 companies and 122 
are correctly classified while the other 22 are assigned to neighboring classes. A 
quantitative expression for misclassification is the generalization error (i.e. the 
fraction of misdassified tuples with respect to the total number of tuples), which 
in this case is equal to 12.6%. 
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Fig. 2. Confusion matrix for clustering class partition 



The same experiment using the class partition described in the previous 
subsection, gave the results reported in Figure 2. The generalization error is 
now 7.7%, less than in the previous experiment. After many experiments we 
observed that generalization errors obtained on clustering-based partitions are 
always less than the ones obtained on domain export partitions. 
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3 A system for balance-sheet classification 

In this section we describe the architecture and main functional aspects of a 
system, called DMTool, which we implemented to support the whole process of 
classification and data description of the balance-sheets of Italian companies. 

DMTool provides an integrated environment for the management and classifi- 
cation of data as well as analysis of results. The system is built in a modular way, 
integrating several existing algorithms and packages. It allows inexpert users to 
elaborate data and build a model making use of data manipulation tools as well 
as clustering and decision tree induction algorithms. The system architecture is 
described in figure 3. 
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Fig. 3. System Architecture 



Users interact with the system by means of a graphical user interface (GUI) 
that allows easy manipulation of original data in the source database. The sys- 
tem manages several archives: the source database, the target database, which 
contains data manipulated for a particular task, and a third database that stores 
models induced from data. In the last archive we store important information 
about model description and classification functions. For each model we store 
its representation, the estimated accuracy and information relative to the pro- 
cess which built the model (training-set, class definition, induction algorithm 
parameter). The main modules of the system are the following. 

Target data management module This module is devoted to the selection of a 
subset of original data ( Target data) for inducing a classification model. The main 
functions available include: tuple selection by two different criteria (random or by 
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user query) , projection of interesting attributes and discretization for continuous 
numerical attributes. 

Class management module This module offers easy and flexible functions for 
class definition. The main goal consists in defining a target attribute that will 
characterize object classes. We allow different ways to perform this operation so 
that users can choose the best method for the particular analysis they want to 
perform. In particular, we allow users to choose as target a discrete attribute, 
or to discretize a continuous one by using one of the several criteria available. 

At the end of the classification process a report for misclassification is pro- 
vided so that the user can verify the validity of the method chosen. 

Data Mining Modules The main Data Mining component is the induction mod- 
ule. It permits the generation of a new classification model starting from the 
training set built by the user as described previously. The core of induction is an 
algorithm for induction of decision trees. The test module allows us to e^^luate 
the prediction capabilities of an existing model applying it to a set of labeled 
test data, selected by the user. The result of this operation is a global measure 
of model accuracy and the confusion matrix. The confusion matrix permits us 
to estimate misclassifications and can suggest, if necessary, a more appropriate 
class definition for data, so the user can tune the method. 

Visualization of data and extracted knowledge This module allows the graphical 
visualization of models as decision trees or decision rules ordered by class. 
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Abstract. Since 1995, we have been developing web search engine “Mon- 
dou” using data mining techniques (http: / /www. kuamp.kyoto-u.ac.jp/labs/ 
infocom/mondou/index_e.html) in order to discover the helpful informa- 
tion for web search operations. In our previous works, we focus on the 
computing cost to derive associative keywords, we propose the method to 
determine system parameters, such as M insup and Minconf threshold 
values. Moreover we evaluate the ROC performance of derived keywords 
by weighted association algorithms. In this paper, we try to implement 
two kinds of Java applets in our Mondou system, such as ROC graph for 
selecting associative keywords and documents clustering. This visual in- 
terface shows characteristics of associative rules on the ROC graph with 
the Minsup values. It also provides the function of document clustering 
in order to visualize retrieved documents. 



1 Introduction 

In the research fields of data mining[10, 2] and text/web mining [1], various 
mining algorithms have been proposed to discover interesting patterns, rules, 
trends and representations in databases. Almost of discovered patterns or rules 
are rather simple as knowledge for experts, but they are sometimes helpful 
for beginners, who don’t have background or domain knowledge. Therefore, we 
extended the algorithm of association rules [14], and we implemented our pro- 
posed algorithm in our web search engine “Mondou” (http://www.kuamp.kyoto- 
u. ac.jp/labs/infocom/mondou/). The Mondou system provides associative key- 
words to search users by extended association algorithms [9] . 

When we developed our web mining engine, we had several problems to derive 
appropriate associative keywords. Since the quality of derived keywords strongly 
depends on the minimum support threshold (Minsup) and the minimum con- 
fidence threshold (Minconf). Therefore, in our previous works[4], we proposed 
the way to adjust the Minsup value dynamically. 

In this paper, based on our previous works[5, 6], we introduce a method which 
specifies the optimal thresholds based on the ROC (Receiver Operating Char- 
acteristic) analysis[ll]. In Section 2, we show the simple sketch of performance 
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evaluation of our proposed algorithm, which is implemented in Mondou system. 
In Section 3, we present the implementation of visual interface for Mondou sys- 
tem. This interface is based on the characteristics of ROC graph and threshold 
values. Furthermore, we show the clustering interface of Mondou system. Finally, 
we make concluding remarks and discuss the future works in Section 4. 



2 ROC Evaluation of associative keywords 

In this section, we show the way to estimate the performance of derived keywords 
by our Mondou search engine. We define following parameters: 



Definition: 

G: A set of keywords including in a query 
n: The number of keywords in G 
ki'. The i’th keyword in G (1 < t < n) 

K^: The set of documents covered by ki 
B: The set of documents covered by G 
to: The number of keywords derived from item set B 
Tj: The j’th keyword derived from item set B (1 < j < to) 

Rj: The set of documents covered by rj 

Moreover, we use the following operators: IJ is the set operator of union, P| is 
the set operator of intersection, and || is the set operator to count the number 
of items. 

Figure 1 shows a status of coverage by B and Rj in the universal set U, which 
means all database items in the system. The items covered by all keywords in 
GisB = nfciK,. 

Next, in order to evaluate the performance of derived keywords rj, we intro- 
duce the ROC analysis [11] An instance can be classified into two classes: the 
positive class P or the negative class N, and positive y (yes) or negative n (no) 
are assigned by a classifier. We define that p{c \ I) is the posterior probability 
that instance I is positive c. The true positive rate TP of a, classifier is the 
following equation: 



TP = p{y I P) 



positive correctly classified 
total positives 



The false positive rate FP of a classifier is as follows: 



FP = p{y I N) 



negative incorrectly classified 
total negatives 



( 1 ) 

(2) 



^ ROC analysis has been used in the signal detection theory to depict tradeoffs be- 
tween the hit rate and the false alarm rate. ROC graphs illustrate the behavior of a 
classifier without regard to class distribution or error cost, they decouple classifica- 
tion performance from these factors. Moreover, the ROC convex hull method is an 
effective way to compare multiple classifiers, it specifies the optimal classifier with 
the highest performance. 
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Fig. 1. Status of database items cov- 
ered by keywords. 
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Fig. 2. The ROC graph of M insup. 



Therefore, in our retrieved set, the positive instance is B that decreases the 
number of documents and the negative instance is B that increases the number 
of them. Thus the true positive instance is B n and the false positive 

instance is B n . TP and FP are represented by the following equations. 

|BnU-,R,| |BnU-,R,| 

I B I ’ I B I 

In Figure 1, the true positive instance is BnR 2 and the false positive instance 
is B n R 2 . 

Furthermore, we need to control the value of M insup according to the fre- 
quency of the keywords appeared in the database. Therefore, we try to divide into 
several classes by the frequency of keywords, and we plot averages of (FP,TP) 
derived from specific keywords, considering the categories in an ROC graph. 

For instance, in order to evaluate the performance of our proposed algorithm, 
we analyzed 330,562 documents, | U |, in Mondou system, and divided ten 
categories. Figure 2 shows the ROC graph plotted by the averages of (FP,TP) 
for each category and each M insup with the boundary of the convex hull. We 
can adjust Minsup value dynamically based on the characteristics in Figure 
2. Furthermore, if we draw the distribution of associative keywords with a few 
attributes, it should be helpful to determine the combination of keywords in a 
modified query. 



3 Visual Interface of Mondou 

At present, we have been constructing the web-based information navigating in- 
terface using CGI and the “Open Text” database. In Figure 3, we show the query 
window of Mondou system in [8, 3]. In addition to these primitive interfaces, we 
try to implement two kinds of Java applets in Mondou system. 
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Fig. 5. Clustering results by Mondou. 
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According to the discussion in our previous paper [7] and in the previous sec- 
tion, as the distance from the point (1,0) in the original ROC graph becomes 
longer, the performance of keywords becomes higher. Therefore, we have to visu- 
alize the distance of it by Java applet. By using ROC graph with {FP,TP, Minsup), 
we can choose much more effective keywords easily. Thus, in order to select key- 
words, we implemented the ROC 3D graph with (FP,TP) and Minsup values, 
which is shown in Fig. 4. Moreover, in this applet, we also implemented several 
functions of rotating, magnifying and fisheye[12] in order to inspect the ROC 
graphs. 

For example, we executed the query with a keyword “mediatoF in Fig. 3, 
Mondou system derives several associative keywords, such as { information, data, 
distributed, supporting, connections, personal } with appropriate Minsup values, 
which is shown in Fig.4. By using this 3D ROC graph, we understand the effec- 
tiveness of associative keywords in order to make clusters of retrieved documents. 
After selecting keywords in Fig.4, we try to execute documents clustering’ func- 
tions by the left side button. 

By visualization of document clusters which is based on VIBE system [13], 
in Fig. 5, we easily understand the items in target clusters including appropriate 
documents without complicated query expressions of several keywords. For in- 
stance, when you select the number in Fig. 5, the items in the cluster are shown 
automatically in the right frame, these results satisfy the “AND” expression of 
keywords “mediator, distributed, information, data”. 

In the results, the former visualization applet is very powerful for selecting 
suitable keywords by ROC graph and support values. The latter visual interface 
is effective for clustering of retrieved documents by combinations of selected key- 
words. These visual interfaces are very helpful tools for information navigation 
Mondou with mining association rules. 

4 Conclusion 

The technique of information visualization becomes to be used widely in various 
data mining researches. However it is too hard to determine the effective meth- 
ods. In this paper, at first, we evaluate the performance of rules based on the 
characters of ROC graphs. Then, considering the ROC convex hull method, we 
try to select appropriate combination of keywords with suitable threshold values 
on the Browser. Moreover, we try to make clusters for retrieved documents. In 
the future, in order to provide much more easy query operations, we have to 
develop much more sophisticated visual applets in our Mondou system. 
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Vmhist: Efficient Multidimensional Histograms with Improved 

Accuracy 

Pedro Furtado and Henrique Madeira 
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Abstract: Data warehouses must be able to process and analyze large amounts of 
information quickly and efficiently. Small summaries provide a very efficient way to 
obtain fast approximate answers to complex queries that run for too long. This paper 
proposes an efficient hierarchical partitioning strategy vmhist achieving a large 
improvement in the accuracy of the summary while maintaining all scalability. This is 
achieved by pre-computation, localized updating and additivity of the error measures 
used in the partitioning process. Evaluation reveals that a significant accuracy 
improvement is obtained for summaries produced with vmhist without significant 
increase in histogram construction time cost. 



1. Introduction 

Multidimensional histograms provide a very practical way to produce small 
queriable summaries of the joint distribution of values of multidimensional data sets 
such as those used in OLAP. The most important metrics to evaluate the quality of 
algorithms used for the construction of histograms are the accuracy of the summary 
when answering queries and the scalability of the algorithm (the algorithm must be 
useful to reduce very large data sets). In fact, the scalability of the algorithm and the 
accuracy are indirectly related, as the most accurate summary construction algorithms 
are frequently not scalable. 

The mhist algorithm was proposed in [3] for the construction of 
multidimensional histograms for selectivity estimation and in [5] for approximate 
OLAP query answering. This algorithm proposes a judicious way to construct 
histograms by using the marginal distributions to determine the partitioning into 
buckets instead of doing a more time-consuming analysis (the marginal distributions 
are small comparing to the data set size). The space partitioning metric used in [3] 
was based on the determination of the largest differences between neighboring values 
in the marginal distribution (mhist-Mcadiff). It is well-known that a heuristic based 
on the homogeneity of the resulting buckets, such as the V-optimal variance based 
constraint [4] would produce higher quality histograms, but this heuristic would imply 
exponential construction time and therefore scalability would be lost, rendering the 
approach imusable [3,4]. Recently, this problem was dealt with for the V-Optimal 
constraint on uni-dimensional histograms for data sets with a reasonably small 
cardinality of values [1]. The problem of finding a scalable high-quality solution for 
high cardinality multi-dimensional data which would provide an important 
improvement to the mhist algorithm remained unsolved. 

The main contribution of this paper is to solve this problem by proposing a 
histogram construction strategy vmhist that is able to maintain scalability while using 
the variance of candidate buckets to conveniently probe the homogeneity of regions, 
and therefore produce a high-quality multidimensional histogram without extra 
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performance cost. The algorithm uses pre-eomputation of values, additivity of error 
measures and localized updating features to aehieve very accurate summaries while 
maintaining sealability. 

The paper is organized as follows: section 2 introduces hierarchieal 
partitioning and section 3 presents the vmhist technique. Seetion 4 presents 
experimental results and section 5 concludes the paper. 



2. Hierarchical Partitioning 



In order to produee a histogram summary of a data set, a spaee partitioning strategy 
must be followed. Adaptable partitioning refers to strategies that try to determine the 
best possible partitioning of the space into regions [2]. This section describes the 
basic hierarchical partitioning algorithm for multidimensional spaees. Given a data set 
with point values p(al,...,an,vl,...,vj,...,vm), the marginal data distribution of attribute 
ai for the value attribute vj is the set marg_ai(xl, Zall database tuples with ai = xl vj), 
where xl represents each individual value taken by attribute ai. The marginal 
distribution is used in the analysis that determines the partitioning instead of the 
whole data, with obvious performance gains. The algorithm suecessively divides one 
or more buekets into two by the end-to-end splitting line that gives the best error 
measure results. 

Figure 1 shows a two-dimensional space with the arrays of marginal distributions 
(sum of values or oecurrences in the row or column) which are analyzed to determine 
the splitting lines. 

col&eq. (Rl) col freq. (R2) 




Splitting line 
row freq. (Rl) 






Rl ^ 


R2 








row freq. (R2) 



Figure 1- Hierarchical Partitioning and Marginal Distribution Arrays 

Alternative partitioning constraints can be used to determine the best splitting line 
from the marginal distributions. For Maxdiff histograms (mhist-Maxdiff) [3], one 
with the largest difference in source values between adjacent values is used. For V- 
Optimal (variance optimal) histograms [4], the splitting line with maximum variance 
of source parameter values is used. 

A naive implementation of a variance-based constraint in multidimensional data 
sets incurs in exponential histogram construction time due to the need to re-compute 
the measure in each iteration using the individual values of partitioned regions. 

The next section proposes the vmhist technique, which uses pre-computed additive 
parameters and localized updates to render variance-based constraints useful in the 
multidimensional data set context. 





Vmhist: Efficient Multidimensional Histograms with Improved Accuracy 433 



3. Variance-Based Hierarchical Partitioning {vmhist) 

The technique vmhist proposes a new scalable strategy to implement complex 
variance-based heuristics within the multidimensional partitioning context. The key 
concepts of this algorithm are: 

• Pre-computation - almost all the parameters that are needed in the iterative 
partitioning process are pre-computed in extended marginal distribution arrays; 

• Additivity - the parameters that are pre-computed and used in the computation of 
the homogeneity of regions are additive. This way, it is possible to compute the 
homogeneity without accessing all the values from the regions. Instead, the pre- 
computed parameters are used to derive a variance measure. Furthermore, only the 
region that was split needs an update of the homogeneity measure; 

• Localized updates - only a very small set of the pre-computed values, 
corresponding to the region(s) that were split, have to be updated in each iteration. 

The algorithm vmhist applies the policies described above to the basic hierarchical 
partitioning strategy. In particular, the variance can be computed additively for any 
region Bi from the values uei (n° of values), Isbi (linear sum of values) and ssbi (square 
sum of values), as shown next: 

2^=1 (Vi-|l)2 

Variance = 

n 

Additivity: VBi(nBi, ssbi, Isbi) = (ssb: - Isb:^ )/ Obi 



3.1. Pre-computing and Localized Updating of Partitioning Parameters in vmhist 

The split decision in mhist-MsxD'iff [3] is taken locally (neighbor marginal 
distribution values). In contrast, the variance-based approach proposed here probes 
the variance resulting from all possible hierarchical splits in all dimensions to 
determine the split that produces the highest overall reduction in variance in each 
iteration. After the split, only the variances of the two resulting buckets need to be 
recalculated and this is done efficiently using pre-computed values. This means that 
vmhist uses the variance measure without the need to re-compute the values of the 
partitioning constraint except locally. This approach is only possible because the 
marginal distribution contains all the values that are needed to calculate the variance 
without accessing the source table. 

As we have shown, the variance is easily expressed as a function of the additive 
parameters nBi, ssb: and Isbi. It can be computed for the union of any number of 
regions using these parameters that must be available for each region: 

VBiuBj u ... uBl = V(nBi -f ...-f nBj, ISBi + ...+ ISbJ, SSBi + ...+ SSbj) 

This way, the pre-computed quantities that must be kept to speedup the 
computation of variance are simply the additive parameters for each position of the 
marginal distributions, which have the structure: 
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dimension 


// identifies the dimension attribute 


bucket 


// identifies the region 


a 


// value taken by the dimension attribute 


counti 


// number of occurences 


ISi 


// sum of values 


SSi 


// square sum of values 


mvi 


// the splitting condition to be maximized 



The statement that creates the marginal distributions becomes: 

SELECT count(di) , sum(Vi), sum(Vi . Vj) 

FROM sourcetable 
GROUP BY di; 

Figure 2 illustrates the extended marginal distribution with the parameters 
described above and also additional cumulative sums that are discussed next. 

RIGTH cumulative sums 



LEFT cumulative sums 



Marginal distribution 



Figure 2 - Histograms of Cumulative Sums 

The pre-computation and localized updating strategy is modified further in 
vmhist, to render the computation of the variance constraint extremely fast. The 
variances of the buckets formed before and after each candidate splitting line can be 
computed at any moment from pre-computed cumulative sums of the parameters that 
are used to derive the variance additively. The structure used to store the pre- 
computed parameters is a cumulative histogram that computes the sums of is, ss and 
count, as shown in Figure 2. The variance and the cumulative values are both pre- 
computed in the beginning and only updated locally in each iteration. The 
LEFT/RIGHT histograms are pre-computed or updated by summing the values from 
the marginal distributions to the left or right of each value. For instance, the following 
statement updates the LEFT histogram: 

sum(countj), sum(lsj) , sum(ssj), (sum(ssj) - sum(lsj) x sum(lsj) /sum(countj) 
FROM marginal distribution 
WHERE aj <= aj. 
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3.2. The Variance-based Partitioning Constraint of vmhist 

In vmhist, a simple variance-based partitioning constraint was used which achieves 
very good accuracy results. The algorithm determines the best splitting lines as those 
with largest decrease in variance between the region and the candidate subregions for 
each possible splitting line. The mv parameter of the marginal distribution arrays is 
pre-computed and updated locally. The partitioning constraint maximizes the 
difference of variances between the bucket and the two buckets resulting from a 
possible partitioning: 

Variance(B) - (Variance(Bl) -i- Variance(B2)) 

( mv = B.« X B.var - left.n x left.var - right.n x right.var ) 

The best splitting line(s) are computed by maximization of the mv parameter. 



4. Experimental Results 

The main issues in the comparative evaluation between histogram construction 
algorithms are the construction time cost and the accuracy of the resulting summary. 
The algorithms must be scalable, as they are intended to summarize very large data 
sets. Figure 3 shows the construction time cost of three strategies: mhist-Maxdiff [3] 
(the basic implementation of mhist using the Maxdiff partitioning constraint); vmhist 
implementation of the Maxdiff (vmhist-Maxdiff), which also precomputes the 
partitioning constraint, and variance variant of vmhist (vmhist-Variance). The results 
show that the pre-computation and localized update features of vmhist resulted in 
faster execution than the basic mhist implementation that needs to re-compute the 
partitioning constraint in each iteration. The results also show that vmhist-Variance is 
practically as fast as vmhist using the Maxdiff constraint. In fact, it can be seen from 
the figure that the vmhist strategy is more scalable than mhist-Maxdiff. 

Figure 4 shows comparative accuracy results between mhist-Maxdiff and vmhist, 
considering range-sum aggregation queries on a Sales data set and summaries with 
5% of the original data set size. The aggregation categories are shown in the x-axis 
and the average number of values per group result are also shown. It is important to 
note that the accuracy is very good for any summary when large ranges are 
considered in all dimensions of the data set, but much lager errors result when queries 
detail some dimensions further or select fewer values. The errors of the group results 
are measured relative to the average value returned by the query. The figure shows 
that vmhist was always able to produce much better estimations than mhist. 

Conclusions and Future Work 

This paper presents a new hierarchical partitioning summary construction 
technique. This technique uses error measure additivity, pre-computation and 
localized updates to derive very accurate summaries. These optimization features 
allow variance constraints to be used scalably to summarize very large data sets. 
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providing a significant improvement over the previous strategies. The experimental 
results show the scalability and improved accuracy of the technique. 




Figure 3-Histogram Construction Cost 




Product Product Family Family Family Family Segment 
Customer Customer Customer Customer Segment 



Figure 4-Comparative Accuracy of vmhist and mhist 
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